It is relatively common to find orphan pages when conducting website audits. They are the debris of consistently evolving websites, where content is being changed, added or removed on a regular basis.
Orphan pages are generally not considered to be a good thing, although they are also not inherently bad, and they typically represent old pages that have forgotten about to some degree.
For background information on what orphan pages are and why they are important, check out our guide to what orphan pages are, why they're bad for SEO, and how to approach fixing them.
This article is focused on how to find orphan pages using Sitebulb, and breaks down into a few sections:
Orphan pages are URLs that are not linked to by other internal URLs on the same website. This means they are not part of the main website architecture, and website visitors would not be able to find them simply by browsing the website. Similarly, search engine crawlers would not be able to discover orphan pages simply by crawling the website.
This means that orphan URLs are discovered by some other means, for example:
The Sitebulb tool defines orphan pages as URLs that were discovered in the audit, but NOT by the crawler.
They will only appear in your audit if you connect other crawl sources, such as XML Sitemaps or Google Analytics.
Firstly, you'll need to crawl the website, so start a new website audit or project.
Then it is a case of connecting up other URL sources, all of which are done in the main audit setup page.
Whilst the options below will show all the different sources, all of them are optional. So for instance, if you only want to identify orphan URLs that are contained in XML Sitemaps, only connect the XML Sitemaps in the audit setup.
Scroll down to the 'Google Analytics' tickbox, and then follow these 3 steps:
This third step is crucial, as it means that URLs which Sitebulb finds in Google Analytics but not in the crawl will also be included in the audit.
Scroll down to the 'Google Search Console' tickbox, and then follow these 3 steps:
There are a few different ways to add XML Sitemaps. To access the XML Sitemaps bit you need to scroll further down still to the section entitled 'Select URL sources to Audit.'
The most basic way is simply to tick the box, like this:
This works if the XML Sitemaps you wish to crawl are easy to discover by Sitebulb automatically. For instance, if they are listed in your robots.txt file, Sitebulb will go and grab these and add the to the list. Similarly, if you have connected Google Search Console, Sitebulb will also go and find any listed in there.
To see which XML Sitemaps that Sitebulb has queued up, click the word 'XML Sitemaps' to open up the full options panel. This will show you which URLs Sitebulb has found and queued to crawl already (in some cases this might be empty, in which case Sitebulb will warn you):
In this case, we can see 2 listed. This is because we (stupidly) refer to the XML Sitemap URL without the trailing slash in the robots.txt file, but with the trailing slash when we submitted to GSC (doh!). Either way, if we wanted to delete any sitemaps we don't want included, just hit the red Delete button over on the right.
Similarly, you can manually add extra XML Sitemap URLs which Sitebulb did not automatically discover, either one by one in the entry box, or lots all at once by hitting the green Add Multiple XML Sitemap URLs button.
Finally, if you want to add XML Sitemaps in file format rather than URLs, you can drag drop these into the panel at the bottom.
This one perhaps takes a little more explaining than any of the options already covered. A URL List is simply a list of URLs that Sitebulb will process during the audit, with the condition that they must be on the same root domain as the start URL.
In terms of finding orphan pages, you would use it as a means of confirming if a set of pre-defined pages are orphaned or not, for instance;
You can add a URL List as a URL source from the same 'Select URL sources to Audit' section that you control XML Sitemaps from.
It is a more straightforward process, however, simply tick the box and drop in the CSV file to import.
The file must be in .csv or .txt format and contain the list of URLs (either in the first column or in a column with a header of 'URL'). Only URLs that contain the same root domain as the start URL will be included.
This is perhaps obvious, but bears repeating: in the URL Sources section, make sure to tick the 'Crawler.' If you do not, then Sitebulb will not be able to tell you which URLs are orphaned as it won't know which ones were not discoverable via internal links in the website.
An audit set up with all of the Crawler, XML Sitemaps and URL List as sources will look like this:
Once the audit has finished running, you can locate orphaned pages from a few different places;
This caters for different workflows, allowing you to access the data visually, via the in-built 'Hints' system, or through a data table.
The first place you will find it is on the Audit Overview, if you scroll down to the chart 'HTML URL Sources', this shows you all the different sources and the URLs found in each:
For each source, this chart shows 3 datapoints:
In the case of orphan URLs, the segment we want is 'Crawler - Missing', AKA, 'URLs not found by the Crawler.' If you click this segment in the chart it will open up a URL List of the URLs in question;
Cool cool, this looks pretty handy.
However, you can see that it is more useful still by scrolling right, to see the 'source' columns themselves:
The row I have highlighted shows that this URL was found in the XML Sitemap, but not in GA, GSC or the URL List. So you can quickly get a good idea of where orphan URLs are coming from. As with all URL Lists, they can be sorted, filtered and exported to spreadsheet format.
You can also see this data in the single URL Details view (which you can access by right-clicking on the URL):
Many users make use of Sitebulb's Hint system as a core part of their website auditing workflow - taking each section in turn, and browsing through the prioritized Hints to understand all the issues within that particular section.
Since orphan URLs are to do with links (or lack thereof), if Sitebulb identifies any orphaned URLs you will find them in the Hints for the Links section:
You can click the blue View URLs button to see the URL data, this will result again in a URL List showing you all the orphaned URLs, along with the crawl sources.
As with all the Hints within Sitebulb, if you click the blue outlined button Learn more about this hint and how to fix this issue it will open a browser window on your computer and take you to the 'Learn more' page on the Sitebulb website, in this case the Hint, 'URL is orphaned and was not found by the crawler.'
This page will give you further context on why orphan pages are important, and assistance in resolving the issue.
Whilst some Sitebulb users prefer visual workflows, and others like to work through the Hints, some users just like to look at big lists of data. Sitebulb has this preference covered to, with the URL Explorer - which is located in the top menu bar.
To find orphaned URLs in the URL Explorer, click the 'Internal' dropdown menu and select 'Orphaned':
This will again take you through to a list of the URL data for all the orphaned pages.
However this time it will remain within the 'frame' of the URL Explorer, and won't take you off to another page:
All of the workflows above lead you to a 'big list of all the URL data', the idea being that you can dig into the data further in this view, and potentially filter or sort the list further.
For example, I could add a filter like this to my list:
Which might constitute 'the things I want to send to my client to fix.'
And then, to generate the spreadsheet to actually send them, all I need to do is hit the green Export button, and select either CSV or Google Sheets.
Please check out this guide to learn more about incorporating Google Sheets into your audit workflows.