So you want to tell Google which web-pages to index and which web-pages to ignore?
Why wouldn't you want Google to Index a Page?
- Security - it's far from ideal to have the "WP-admin" login page of your latest Wordpress creation being indexed and served up to potential visitors or opportunist hackers. Keep that secure data as hidden, as encrypted and as wrapped-up in security protocols as possible
- User Journey - unless you're a creative genius, gunning for some kind of unique user experience, you don't want a potential customer landing on the 'thank you' page at the end of your checkout process.
It's also a good idea to have an understanding of Robots texts and tags -
in case you need to investigate why a given website isn't being indexed.
How do you stop Google from Indexing Specific Webpages?
There are a number of methods that can be used...
The most common method of telling Google not to 'crawl' and index a page is to use the robots.txt file.
The robots.txt sets "crawler directives".
For example, if you wanted to stop any spiders from crawling the login page for the backend of your Wordpress site, then you would enter the following into your robots.txt:User-agent: *Disallow: /wp-admin/Disallow: /wp-login.php
Telling all spiders (User-agents) not they're 'not allowed' to crawl the wp-admin page of your site. Google recognises two regular expressions.
You can enter the wildcard entry * to specify "all".
Will tell all spiders / robots (referred to as"user-agent" in robots.txt) not to crawl any of your site.
You can also use the dollar sign to tell different spiders whether or not to crawl pages ending with certain naming conventions, URL parameters or identifiers.
Will tell all spiders, not to crawl any pages ending with .html
Robots.txt is the least robust of all the methods for 'excluding' content & pages:
"...pages that search engines aren’t allowed to spider, can still show up in the search results, when they have enough links pointing at them. This basically means that if you want to really hide something from the search engines and thus from people using search,robots.txt just isn’t good enough."
The Meta Robots Tag
The meta robots tag is the preferred method of 'disallowing' specific pages (according to most experts).
The Meta robots tag sets "Indexer directives", rather than "crawler directives".
This is a more robust way of preventing a page from being indexed.
Place the meta robots tag before the closing tag in a web page's HTML.
"NOINDEX" tells spiders not to index the page in their index/listings.
"NOFOLLOW" tells the spiders not to follow or crawl the links on the page.
If you want to tell different spiders to crawl a page, and others not to, then you can add specific, and multiple tags:
The X-Robots-Tag is a HTTP header.
It provides the most flexibility of all methods. Again this sets "indexer directives".
If you want to check if a site has X-Robots-Tag in place, use Screaming Frog to crawl the site, and then click the "Directives" Tab.
As an example of the specificity you can get down to with X-Robots-Tag -
If you want to prevent search engines from caching any text files or pdfs that you may have on your site, you would set the following in X-Robots-Tag:
Header set X-Robots-Tag "index, noarchive, nosnippet"
nosnippet prevents Google or other search engines from showing a preview of your page in the Search Engine Results Pages (SERPs).
noarchive prevents a cached link of your page showing in the SERPs
noodp tells Google not to offer a translation of this page in the SERPs
unavailable_after: [RFC-850 date/time] tells Google not to show a specific page in the SERPs after a specific date.
There's more specific information on how to configure X-Robots-Tags here.
All the fun ways to manipulate how and which webpages are indexed by Google and other search engines.
The REP META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type. We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file.