Indexability relates to the technical configuration of URLs so that they are either Indexable or Not Indexable.
Search engines generally take the stance that any successful URLs (i.e. HTTP status 200) they find should be indexed by default - and they will, in the main, index everything they can find. However, there are certain signals and directives you can give to search engines that instruct them to NOT index certain URLs.
Setting URLs so that they are Not Indexable is a relatively common task, and straightforward to do in most modern CMSs. You might want to set a URL to noindex, for instance, if it is useful to website users, but is not a page that would represent a useful search result (e.g. a 'print' version of a page).
However, indexing signals often get misconfigured, or set up incorrectly, which can result in important URLs not getting indexed. An important thing to note is that if a page is not indexed, it has no chance to generate any organic search traffic.
Sitebulb's Indexability Hints deal with the robots.txt file, meta robots tags, X-Robots-Tag and canonical tags, and how these directives may impact the way in which URLs are crawled and indexed by search engines.
What are robots directives?
Robots directives are lines of code that provide instruction on how search engines should treat content, from a the perspective of crawling and indexing.
By default - or with the absence of any robots directive - search engines work under the basis that every URL they encounter is both crawlable and indexable. This does not mean that they necessarily will crawl and index the content, but that it is the default behaviour should they encounter the URL.
Thus, robots directives are essentially used to change this default behaviour - by instructing search engines to either not crawl, or not index, specific content.
How are robots directives presented to search engines?
There are 3 ways in which robots directives can be specified:
- Robots meta directives (also called 'meta tags'), which work at a page level. Within the <head> of a page's HTML, you include meta tags like this:
<meta name="robots" content="noindex, nofollow"> to control crawling and indexing on a specific URL.
- X-robots-tags, which can be added to a site's HTTP responses, and can control robots directives on a granular, page level, just like meta tags, but can also be used to specify directives across a whole site, via the use of regular expressions.
- Robots.txt file, which normally lives on example.com/robots.txt, and is typically used to instruct search engine crawlers which paths, folders or URLs you don't want it to crawl, through 'disallow' rules.
In the methods outlined above, if the 'nofollow' directive is used, it means that you do not wish for any of the links on the page to be followed. However, it is also possible to specify that individual links should not be followed, via the nofollow link element.
What is a canonical?
In the field of SEO, a 'canonical', is a way of indicating to search engines the 'preferred' version of a URL. So if we have 2 URLs that have very similar content - Page A and Page B - we could put a canonical tag on Page A, which specifies Page B as the canonical URL.
To do this, we could add the rel=canonical element in the <head> section on Page A;
If this were to happen, you would describe Page A as 'canonicalized' to Page B. In general, what this means is that Page A will not appear in search results, whereas Page B will. As such, it can be a very effective way of stopping duplicate content from getting indexed.
When you set up a canonical, you are effectively saying to search engines: 'This is the URL I want you to index.' People may refer to a canonical as 'a canonical tag', 'rel canonical' or even 'rel=canonical'.
In Sitebulb, if a URL is canonicalized, it is also classed as 'Not Indexable.' Conversely, if a URL has a self-referential canonical (i.e. a canonical that points back to itself) this URL would be Indexable.
Self-referential canonicals are a useful default configuration, and are typically set up to help avoid duplicate, parameterized versions of the same URL from getting indexed, for example:
How are canonicals implemented?
The most common way that canonicals are implemented is through a <link> tag in the <head> section of a URL. So on Page A, we could specify that the canonical URL is Page B with the following:
Canonicals can also be implemented through HTTP headers, where the header looks like this:
Typically, this is used to add canonicals to non-HTML documents such as PDFs, however they can be used for any document.
As such, it is considered best practice to only ever use one method of assigning canonicals for each URL on a given website.
Most of the Indexability Hints are Issues, which represent errors or problems that need to be fixed. They are additionally classified in terms of their importance - this should be taken into account when prioritizing implementation work, along with the number and type of URLs affected.
These Hints require immediate attention, as the issue may have a serious impact upon crawling, indexing or ranking.
- Disallowed image
- Disallowed Style Sheet
- <head> contains a <noscript> tag, which includes an image
- <head> contains invalid HTML elements
These Hints are very important, and definitely warrant attention.
- Canonical points to a noindex URL
- Canonical is malformed or empty
- Canonical loop
- Next/Prev Paginated URL is canonicalized to different URL
- Canonical points to a disallowed URL
- Canonicalized URL is noindex, nofollow
- Canonical points to a URL that is Error (5XX)
- Canonical points to a URL that is Not Found 404
- Canonical points to another canonicalized URL
- Canonical points to HTTP version
- Canonical points to HTTPS version
- Mismatched canonical tag in HTML and HTTP header
- Mismatched nofollow directives in HTML and header
- Mismatched noindex directives in HTML and header
- Multiple, mismatched canonical tags
- Meta robots found outside of <head>
- Canonical only found in rendered DOM
- Rendered canonical is different to HTML source
These Hints are worth investigating further, and may warrant further attention depending on the type and quantity of URLs affected.
These Hints are of the lowest significance, and should only be addressed if there aren't more serious issues which have not been handled.
Indexability Potential Issues
Hints marked 'Potential Issue' describe a situation that might be an issue, or might cause an issue. In the case of Indexability Hints, they typically highlight configurations which are not damaging right now, but could cause issues further down the line.
- URL contains a form with a GET method
- Canonical outside of head
- Noindex found on rel Next/Prev Paginated URL
- Multiple nofollow directives
- Multiple noindex directives
- Nofollow in HTML and HTTP header
- Noindex in HTML and HTTP header
- Canonical URL has no incoming internal links
- Canonical tag in HTML and HTTP header
- Multiple canonical tags
Hints marked 'Opportunity' describe where you could optimize the site to potentially improve performance further.
Insights are neither issues nor opportunities, and often don't require any action at all - they are brought to your attention as they may provide a useful avenue of investigation.