How to find isolated pages on your website

This article explains how you can find isolated pages on your website, starting with some general theory and explanations about what isolated pages actually are.

"Isolated pages are HTML URLs that are not part of the internal link graph on the website, yet can be reached by crawlers."

Isolated pages are similar to orphan pages, but not quite the same. Orphan pages have no internal links pointing to them, and you can't find them with a straightforward crawl of the website. Whereas isolated pages can be found by the crawler, but only through links which do not contribute to the link graph (e.g. via a canonical link element).

We'll break down what this means and why isolated pages are important, starting with the link graph:

What is a link graph?

Crawler accessibility is one of the fundamental pillars of technical SEO - pages need to be findable by search engine crawlers as a precursor to being indexable.

In most cases, pages are linked together via internal links, which means that search engine crawlers can extract these links and find the location of other pages to crawl. Internal links are also essential for search engines to understand how your pages are inter-related, and for determining the relative importance of all the pages on your website.

Interlinked pages can be mapped to a 'link graph', which considers all the pages and how they are connected by links. A very simple version of a link graph could look like this:

Normal Page Connections

Where Page A and Page B link to one another, and Page B and Page C link to one another. Although Page C is not directly linked to by Page A, it can be reached via the link to Page B, and so all three pages exist within the same network.

If you have a page that lives on your website but is not linked to by any other pages, this is typically known as an orphan page.

What is an orphan page?

Let us consider another diagram, with an additional page that has no incoming links:

Page D is orphaned

We can see that Page D is not linked to the network, and is therefore not part of the link graph at all - so a website crawler that starts at Page A can not traverse the links and eventually discover Page D - it is an orphan page.

"Orphan pages are HTML URLs that are not accessible through crawling the website."

Orphan URLs are reasonably well understood in the SEO community, and you encounter them when you deal with URL Sources that don't rely on crawling the website, such as XML Sitemaps or lists of external backlinks.

We have some guides and documentation on orphan pages:

How are orphan URLs different to isolated URLs?

If we return to our original definition we can see how this differs from orphan pages:

"Isolated pages are HTML URLs that are not part of the internal link graph on the website, yet can be reached by crawlers."

Isolated URLs can be reached and found by crawlers (such as Sitebulb), but are not part of the link graph - which makes them similar to, but not the same thing as orphan URLs.

Why do isolated pages exist?

Isolated pages can be accidentally created due to certain technical (mis)configurations, which we will explore below.

But first it is helpful to consider a deliberate configuration where we want to include links to a page, but we don't want search engines to follow those links.

For example, if we did not want Page C to be indexable or accessible to search engines, we could handle this as follows: 

Nofollow isolated page

The nofollow directive on Page B means that search engine crawlers are instructed to not follow the link through to Page C (crawlers like Sitebulb will also not follow this link by default, but can be configured to crawl links with nofollow directives).

This means that Page C is not accessible to search engine crawlers. In other words, it is isolated from the link graph:

Noindex nofollow page isolated

This is the case because Page C does not have any incoming links from other pages in the network (i.e. Page A) - so search engines have no way of discovering the page. In this case, Page C is intentionally isolated from the link graph.

We are not concerned about pages that are intentionally isolated, however we are concerned about pages that are unintentionally isolated, and this can happen due to one of these scenarios:

  • URLs that can only be found via a canonical link element
  • URLs that can only be found via a redirect
  • URLs that can only be found via noindex,follow
  • URLs that can only be found via an iframe
  • URLs that are only linked to by other isolated URLs (i.e. children of isolated pages)

Again, to clarify, the 'can only' part means that these URLs do not have any other incoming followed links from internal pages.

Different types of isolated pages

These five different types of isolated pages occur due to slightly different technical setups, so we will explore each in turn below.

Only found via a Canonical

This is a situation where a URL exists on the website and is actually reachable by crawlers - but only by following a link rel="canonical" tag:

Only linked via canonical

​None of the other pages in the website link to Page C, which means that Page C ends up isolated from the link graph:

Canonicalised page isolated

Canonical tags exist to differentiate duplicate content, they allow website owners to say to search engines, 'hey I know that Page B and Page C are duplicate, and I would like you to include Page C in the index - not Page B.'

The thing is, Google do not always agree with the assessment of site owners, and often end up ignoring the canonical completely. This is an example where that might happen. The canonical tag effectively identifies Page C as more important than Page B. But the internal link signals do not support this, as they are pointing at Page B instead of Page C, which Google will likely find contradictory and thus ignore the canonical assignment.

Only found via a Redirect

This is a situation where a URL exists on the network and is actually reachable by crawlers - but only by following a redirect:

Via 301 redirect

This is not quite the same as the canonical situation above, as Page C actually does (at least theoretically) get assigned the link equity from any incoming links to Page B, and it is possible to end up at Page C simply by following links. However, the only way to get there is via a 301 hop, because Page C itself is isolated from the link graph:

Redirect page is isolated

Only found via a noindex,follow

This is a situation where a URL exists on the network and is actually reachable by crawlers - but only by following a robots directive with noindex,follow:

Noindex follow

This is a curious case because Page B is effectively saying to search engines, 'don't index this page, but please do follow all the links on it.'

In a Google Webmaster Hangout in 2017 John Mueller clarified how Google handle noindex, follow:

"It's tricky with noindex, which I think is somewhat of a misconception in general within the SEO community. With a noindex and follow it's still the case that we see the noindex. In the first step we say 'okay you don't want this page shown in the search results, we'll still keep it in our index, we just won't show it and then we can follow those links.'

If we see the noindex there for longer then we think this page REALLY doesn't want to be used in search so we will remove it completely. And then we won't follow the links anyway. So noindex and follow is essentially the same as a noindex, nofollow. There's no really big difference there in the long run."

This means that in the long run, Page B is considered to be noindex,nofollow, so Google will stop following the links. Since Page C has no incoming links from other pages, it ends up isolated from the link graph:

Long term nofollow

Only found via an iframe

This is a situation where a URL exists on the network and is actually reachable by crawlers - but only because the URL is embedded in an iframe on another URL:

Page embedded iframe

Since no other pages link to Page C, it is isolated from the link graph:

Iframe Isolated

This is not necessarily an issue, and could very well be a deliberate setup. For example, Page C might simply be a set of terms, such as a returns policy. The site owner is not interested in Page C being indexed or ranking in search, but wants to include the content from the page on other internal URLs - and using iframes is one way this could be achieved. However, even this setup could cause complications when it comes to children URLs (see below).

Linked from Isolated URL

This is a situation where a URL does have incoming internal links, but only from other isolated URLs:

Isolated child page

This situation means that, due to the noindex on Page B, Page C will end up isolated. Since Page E has no incoming links from any other pages, it also ends up isolated from the main link graph:

Child page ends up isolated

This can be difficult to recognise in practice, since Page E actually does have an incoming link (from Page C) - but it is not connected to the rest of the link graph.

This becomes more complicated still when you factor in the myriad complexities of a typical website, with lots of legacy rules or different content creators publishing pages, you can easily end up in a bit of a mess:

Myriad complexities

Which can end up with large chunks of pages becoming isolated, and easily slipping under the radar:

Lots of children pages isolated

Now, if we ramp up our consideration to websites with thousands of pages, it is clear we need a software solution that detects these issues automatically...enter Sitebulb.

Why are isolated pages a problem?

Everything we've covered so far explains the what and the how, but hasn't really tackled 'why you should care'. And the reason is quite simple - internal links are a really powerful signal to Google.

Internal link popularity helps Google understand which pages on your website are more important than others, and pages with more internal links will have a better ability to rank in the search results, as they will have more PageRank.

By definition, isolated pages do not have any internal links pointing at them, which means that they will struggle to rank in the search results, if Google even indexes them at all.

How to identify isolated pages in Sitebulb

Sitebulb will report on isolated URLs as long as 'Search Engine Optimization' is ticked as one of the audit options when you set up the audit (it is always pre-checked by default). It shows as a section of the Indexability report:

Indexability report

Clicking on 'View URLs' will allow you to see the affected URLs, and dig in further to investigate the issues, or you can easily export the list to CSV or Sheets via the 'Export URLs' button.

How to investigate isolated pages in Sitebulb

As we have seen, isolated pages can be both complex and subtle, yet Sitebulb will surface them for you automatically. However, you still need to understand what's happening, which requires a bit of digging and investigating.

From the table, click 'View URLs' for the state you wish to investigate further;

Click View URLs

This will take you to a URL List, with an orange 'Crawl Path' button on the left - press this to investigate a specific URL;

Isolated canonical hint details

This will then show you how Sitebulb traversed the website to find the isolated URL:

Crawl Path canonicalized URL

In this case, the homepage (Depth 0) links via an anchor to the URL at Depth 1.  This URL is canonicalized to the URL at Depth 2. You can clearly see how the URL was found by tracing back the steps in the crawl path.

You can also verify that the URL is indeed isolated, and not connected to the link graph, by checking the incoming links:

No incoming links

How to resolve isolated pages

In addition to the 'isolated pages' panel on the Indexability report, you will also find these issues listed in the Indexability Hints.

<INSERT IMAGE>

The 'Learn More' pages for each of these Hints contain specific instructions for how to deal tackle each type of isolated page:

<INSERT UNORDERED LIST>