How to audit canonical tags

Canonical tags are a useful signal to indicate to search engines the 'preferred version' of two similar or duplicate URLs, and as such they are one of the primary technical solutions used to solve duplicate content issues.

On sites that have lots of pages with similar content, utilising canonical tags can be an important method of helping ensure the 'correct' URLs are indexed.

Issues with canonicals can have a profound impact upon the indexing of a site, so it is important that they are analysed as part of the 'indexability' portion of a website audit.

This guide explains how to use Sitebulb website audit software to audit canonical tags, and how to unearth issues with canonicals that may require attention.

How to find canonical tags

You can instruct Sitebulb to audit canonical tags by ensuring that the 'Search Engine Optimisation' option is checked during the audit setup (it is always checked by default on new projects):

SEO Audit Option

Once you have made your other audit data selections and set the crawler running, wait for it to complete the audit. Then, head to the 'Indexability' report using the left hand menu:

Indexability Report

In this report you will find all the data about robots and indexing signals, such as noindex and canonicals. In particular, you will see two pie charts which show the Indexability Status and Canonicals respectively. 

Indexability status pie charts

The right hand chart has only 4 possible options:

  1. Canonical to self (most common, this is typically a default in the page template).
  2. Canonical to internal URL (this means that a canonical has been explicitly set to another URL on the same domain).
  3. Canonical to external URL (least common, this means that a canonical has been explicitly set to a URL on a different domain).
  4. Missing canonical (typically means that there is no canonical field in the page template).

Clicking on any of these segments will bring you through to a URL List of the corresponding URLs. For example, in the right hand chart if I click the 'To Internal URL' segment, this brings me through to a URL List that shows all the URLs which have a canonical pointing at another internal URL:

URL List canonicals

In general, there is not much benefit to checking URLs that have a canonical to self, but all other statuses are worth checking. Where a canonical has been set, the thing you need to ensure is that the canonical has been set to the correct URL.

For example, you may find a site that allows URLs with and without trailing slashes to return a 200 status (i.e. there is no redirect between them). If a canonical is used to indicate one of these options to be used for indexing, you want to ensure that the canonical is correctly identifying the right URL (e.g. /pages/page-a has a canonical to /pages/page-a/).

If you are unsure, certain things can help make it clear which option is preferred:

  • Google indexes the URL with the trailing slash
  • All/most of the internal links on the site point to the trailing slash option
  • The trailing slash URLs are included in the XML Sitemap, whereas URLs with no trailing slash are not

Auditing canonical tags in this way can require a level of experience or a deeper understanding of the specific website and how it is set up. 

How to find canonical tag issues

There can also be technical errors or inconsistencies due to the way in which canonicals are set up, that may mean that search engines will simply ignore them when making indexing decisions.

Sitebulb will automatically check every internal URL for a wide range of potential canonical issues, and if any issues are found, these will be presented via the 'Hints' tab;

Canonical Hints

While these Indexability Hints can also contain other indexability issues outside of canonicals, this is the place to go if you want to find out if the website has any canonical tag issues.

As you can see, each Hint is given an 'importance' rating (Critical/High/Medium/Low) and a percentage coverage, so you can quickly see at a glace how serious or widespread an issue is. As with all other Hints in Sitebulb, you can explore further by clicking 'View URLs' to see the list of affected URLs, or the 'Learn more about this hint' button (bottom left) will take you to an explainer page on our website about the specific Hint (you can also learn more about Hints here).

How to understand the effects of rendering on canonical tags

Google's ability to render web pages has improved considerably over the last few years, and they now claim the following;

  1. They render pretty much every URL they encounter
  2. Indexing decisions are made after rendering has taken place

There are some more subtleties surrounding this, so if you want to learn more we recommend you check out our guide How JavaScript Rendering Affects Google Indexing, however the two statements above are enough to conclude that it is important to consider the affects of rendering when auditing indexability signals, such as canonicals.

In particular, you need to know how to crawl the website to ensure you are working with the correct data. If you are using the HTML Crawler to audit your website, yet JavaScript is changing canonicals during rendering, you will not be looking at the same data as Google and may make incorrect assumptions.

To figure out the impact of rendering, make use of Sitebulb's response vs render comparison report. One of the elements this report will show you is the impact upon canonicals - in the right hand pie chart below:

Response vs render

The pie chart segments correspond to:

  • No Change - the canonical is identical in the response and rendered HTML
  • Created - the canonical element was not present in the response HTML, and is only present in the rendered HTML (therefore has been 'created' by JavaScript)
  • Modified the canonical element was present in the response HTML, but the canonical URL is different in the rendered HTML (therefore has been 'modified' by JavaScript)
  • Duplicated the canonical element was present in the response HTML, but is present twice in the rendered HTML (therefore has been 'duplicated' by JavaScript)
  • Deleted - the canonical element was present in the response HTML, but is not present in the rendered HTML (therefore has been 'deleted' by JavaScript)

The intention of this report is as a diagnostic device - use it to explore the affects of JavaScript, and then dig in further if you see something that warrants further attention.

The most straightforward outcome is of course that everything is listed as 'No Change.' This means you don't need to dig any further, and in fact means that the HTML Crawler is sufficient for future analyses, as the canonicals are not dependent on JavaScript, which effectively means that Response HTML = Rendered HTML (at least in terms of the canonicals).

However, if there are differences in the canonical between the response and rendered HTML, you should always ensure to audit this website using the 'Chrome Crawler', as you will be using inaccurate date otherwise.