Indexability

Indexability relates to the technical configuration of URLs so that they are either Indexable or Not Indexable.

Search engines generally take the stance that any successful URLs (i.e. HTTP status 200) they find should be indexed by default - and they will, in the main, index everything they can find. However, there are certain signals and directives you can give to search engines that instruct them to NOT index certain URLs.

Setting URLs so that they are Not Indexable is a relatively common task, and straightforward to do in most modern CMSs. You might want to set a URL to noindex, for instance, if it is useful to website users, but is not a page that would represent a useful search result (e.g. a 'print' version of a page).

However, indexing signals often get misconfigured, or set up incorrectly, which can result in important URLs not getting indexed. An important thing to note is that if a page is not indexed, it has no chance to generate any organic search traffic.

Sitebulb's Indexability Hints deal with the robots.txt file, meta robots tags, X-Robots-Tag and canonical tags, and how these directives may impact the way in which URLs are crawled and indexed by search engines.

What are robots directives?

Robots directives are lines of code that provide instruction on how search engines should treat content, from a the perspective of crawling and indexing.

By default - or with the absence of any robots directive - search engines work under the basis that every URL they encounter is both crawlable and indexable. This does not mean that they necessarily will crawl and index the content, but that it is the default behaviour should they encounter the URL.

Thus, robots directives are essentially used to change this default behaviour - by instructing search engines to either not crawl, or not index, specific content.

How are robots directives presented to search engines?

There are 3 ways in which robots directives can be specified:

  • Robots meta directives (also called 'meta tags'), which work at a page level. Within the <head> of a page's HTML, you include meta tags like this:
    <meta name="robots" content="noindex, nofollow"> to control crawling and indexing on a specific URL.
  • X-robots-tags, which can be added to a site's HTTP responses, and can control robots directives on a granular, page level, just like meta tags, but can also be used to specify directives across a whole site, via the use of regular expressions.
  • Robots.txt file, which normally lives on example.com/robots.txt, and is typically used to instruct search engine crawlers which paths, folders or URLs you don't want it to crawl, through 'disallow' rules.

In the methods outlined above, if the 'nofollow' directive is used, it means that you do not wish for any of the links on the page to be followed. However, it is also possible to specify that individual links should not be followed, via the nofollow link element.

What is a canonical?

In the field of SEO, a 'canonical', is a way of indicating to search engines the 'preferred' version of a URL. So if we have 2 URLs that have very similar content - Page A and Page B - we could put a canonical tag on Page A, which specifies Page B as the canonical URL.

To do this, we could add the rel=canonical element in the <head> section on Page A; 

<link rel="canonical" href="https://example.com/page-b" />

If this were to happen, you would describe Page A as 'canonicalized' to Page B. In general, what this means is that Page A will not appear in search results, whereas Page B will. As such, it can be a very effective way of stopping duplicate content from getting indexed.

When you set up a canonical, you are effectively saying to search engines: 'This is the URL I want you to index.' People may refer to a canonical as 'a canonical tag', 'rel canonical' or even 'rel=canonical'.

In Sitebulb, if a URL is canonicalized, it is also classed as 'Not Indexable.' Conversely, if a URL has a self-referential canonical (i.e. a canonical that points back to itself) this URL would be Indexable.

Self-referential canonicals are a useful default configuration, and are typically set up to help avoid duplicate, parameterized versions of the same URL from getting indexed, for example:
https://example.com/page?utm_medium=email

How are canonicals implemented?

The most common way that canonicals are implemented is through a <link> tag in the <head> section of a URL. So on Page A, we could specify that the canonical URL is Page B with the following:

<link rel="canonical" href="https://example.com/page-b" />

Canonicals can also be implemented through HTTP headers, where the header looks like this:

HTTP/... 200 OK

...
Link: <https://example.com/page-b>; rel="canonical"

Typically, this is used to add canonicals to non-HTML documents such as PDFs, however they can be used for any document.

As such, it is considered best practice to only ever use one method of assigning canonicals for each URL on a given website.

Sitebulb Desktop

Find, fix and communicate technical issues with easy visuals, in-depth insights, & prioritized recommendations across 300+ SEO issues.

  • Ideal for SEO professionals, consultants & marketing agencies.

Sitebulb Cloud

Get all the capability of Sitebulb Desktop, accessible via your web browser. Crawl at scale without project, crawl credit, or machine limits.

  • Perfect for collaboration, remote teams & extreme scale.