So you want to tell Google which web-pages to index and which web-pages to ignore?

Why wouldn't you want Google to Index a Page?

  • Security - it's far from ideal to have the "WP-admin" login page of your latest Wordpress creation being indexed and served up to potential visitors or opportunist hackers. Keep that secure data as hidden, as encrypted and as wrapped-up in security protocols as possible
  • User Journey - unless you're a creative genius, gunning for some kind of unique user experience, you don't want a potential customer landing on the 'thank you' page at the end of your checkout process.

It's also a good idea to have an understanding of Robots texts and tags -

in case you need to investigate why a given website isn't being indexed. 

How do you stop Google from Indexing Specific Webpages?

There are a number of methods that can be used...

Robots.txt

The most common method of telling Google not to 'crawl' and index a page is to use the robots.txt file.

This is a text file which is uploaded to the root of a domain and can be used to tell Google, and all other spiders (also called "robots" hence the name robots.txt) what to crawl.  

The robots.txt sets "crawler directives". 

For example, if you wanted to stop any spiders from crawling the login page for the backend of your Wordpress site, then you would enter the following into your robots.txt:User-agent: *Disallow: /wp-admin/Disallow: /wp-login.php

Telling all spiders (User-agents) not they're 'not allowed' to crawl the wp-admin page of your site. Google recognises two regular expressions.

You can enter the wildcard entry * to specify "all". 

For example 

User-agent: *

Disallow: /

Will tell all spiders / robots (referred to as"user-agent" in robots.txt) not to crawl any of your site.

You can also use the dollar sign to tell different spiders whether or not to crawl pages ending with certain naming conventions, URL parameters or identifiers. 

For example

User-agent: *

Disallow: /*.html$

Will tell all spiders, not to crawl any pages ending with .html

Robots.txt is the least robust of all the methods for 'excluding' content & pages:

"...pages that search engines aren’t allowed to spider, can still show up in the search results, when they have enough links pointing at them. This basically means that if you want to really hide something from the search engines and thus from people using search,robots.txt just isn’t good enough."

http://sebastians-pamphlets.com/

The Meta Robots Tag

The meta robots tag is the preferred method of 'disallowing' specific pages (according to most experts).

The Meta robots tag sets "Indexer directives", rather than "crawler directives".

This is a more robust way of preventing a page from being indexed.

Place the meta robots tag before the closing tag in a web page's HTML.

For example

"NOINDEX" tells spiders not to index the page in their index/listings.

"NOFOLLOW" tells the spiders not to follow or crawl the links on the page.  

If you want to tell different spiders to crawl a page, and others not to, then you can add specific, and multiple tags:

The X-Robots-Tag

The X-Robots-Tag is a HTTP header.

It provides the most flexibility of all methods.  Again this sets "indexer directives".

If you want to check if a site has X-Robots-Tag in place, use Screaming Frog to crawl the site, and then click the "Directives" Tab.

As an example of the specificity you can get down to with X-Robots-Tag -

If you want to prevent search engines from caching any text files or pdfs that you may have on your site, you would set the following in X-Robots-Tag:

Header set X-Robots-Tag "index, noarchive, nosnippet"

nosnippet prevents Google or other search engines from showing a preview of your page in the Search Engine Results Pages (SERPs).

noarchive prevents a cached link of your page showing in the SERPs

noodp tells Google not to offer a translation of this page in the SERPs

unavailable_after: [RFC-850 date/time] tells Google not to show a specific page in the SERPs after a specific date.

There's more specific information on how to configure X-Robots-Tags here.

That's it! 

All the fun ways to manipulate how and which webpages are indexed by Google and other search engines.

Right laugh.