Limiting the Sitebulb Crawler for Faster and Cleaner Audits
Sitebulb has an 'Advanced Settings' option which includes a number of options for limiting your crawl.
I know what SEOs are like though: 'I want to crawl all the pages, I want ALL THE DATA.'
Why would you want to limit the crawl?
While you can get away with the 'all the data' approach on smaller sites, as soon as you start trying to crawl large sites, you are going to start running into problems.
For example, I crawled an ecommerce site with 50,000 pages indexed in Google. The faceted navigation spawned so many parametrized pages that the total crawled URLs was over 800,000.
This is the obvious benefit of limiting the crawl - by only crawling the URLs you want to crawl you can save a lot of time. But it also means the reports will be cleaner. For example, the Indexation report won't be over-run with canonicals for all the parametrized URLs, making it easier to see the 'real' data you want to look at.
Worked Example: https://shop.ee.co.uk
We'll work though an example from start to finish. Say EE was my client, and I wanted to Audit their online store, I might start by just doing a vanilla Audit, to see what I get.
Subdomains and external links
First things first, we can see that there are quite a few external links:
On closer inspection, lots of these are subdomain URLs (which makes a lot of sense in this instance):
Let's assume we want to re-audit the site and adjust some of the Advanced Settings to clean this stuff up.
From the Project, we'll need to start a new Audit and choose 'Start with New Settings', then scroll down to find the Advanced Settings.
From here, we can turn off checking subdomain links and/or external links, which are always on by default.
Since we're looking at an ecommerce store, it's not surprising to find lots of parametrized URLs. To find these, we just head over to the filtered URL List for 'Internal HTML URLs', then put a filter on the URL column: 'contains ?'.
From a total of 4690 internal HTML URLs, 4375 of them contain a query string.
These are caused by a typical faceted navigation, where every 'refinement' option down the left spawns more new URLs, which are picked up and crawled by Sitebulb.
At this point - it might actually be useful that these URLs have been crawled - there are no canonicals on any of them so they might be getting crawled and indexed by Google, which might be a problem. So knowing about them allows you to do something about them.
But you might have already done something about them, by handling them in Google Search Console, for example. And having them included in the Audit just muddies the water for all the other reports. Consider the word counts:
The the 1-25 word bucket has been enormously inflated by the parametrized URLs, causing the graph itself to be less meaningful.
Fortunately, it's very easy to stop Sitebulb from crawling these type of URLs, if we return to the Advanced Settings on our re-audit, and select the Parameters tab.
To exclude parameters, we need to untick the 'Crawl Parameters' option.
This will stop Sitebulb from crawling any parameters at all, which might actually be a little over the top, in some circumstances. So if there are some parameters which you do want to follow (e.g. pagination parameters), you'll need to enter them in the box below.
URLs and Paths
In addition to parameterized URLs, we can also find other patterns which are causing extra pages to be crawled.
In this case, we see some /basket/ URLs, all of which just redirect to /cart/.
While the scale isn't actually a problem in this instance, a similar scenario on a bigger site could be occurring on every single product page, inflating the total crawled numbers massively.
If this was the case, we could just exclude the path /basket/ from the Excluded URLs list in Advanced Settings.
This simply means that when the crawler encounters links that start shop.ee.co.uk/basket/ - it will not schedule these URLs to be crawled, so they won't appear in your reports at all.
You can do this in another way by including URLs. Say we were only interested in crawling the mobile phones section of the site. To do this, we simply need to add /mobile-phones/ to the Included URLs list in Advanced Settings.
What this does is it excludes everything that doesn't match the specified path. So when the crawler encounters links to shop.ee.co.uk/basket/ - these WILL be scheduled to crawl, but nothing else will be (unless you included some more paths or URLs in the Included URL List).
Setting Hard Limits
Everything we'd looked at so far concerns methods for limiting the crawl with a degree of specificity. They allow you to have control over what URLs are crawler or included in your Audit reports.
This final method is quite the opposite - a much more blunt approach to simply stopping the audit before it gets too big.
In the main Crawler Settings, you can set the Maximum URLs to Audit by simply adjusting this figure.
By default this value is set to 500,000 URLs but you can adjust this to be whatever you wish (up to a hard maximum of 2 million URLs). In the example above, this would mean Sitebulb would crawl 250,000 URLs in total (including resources and external links, but NOT including any disallowed or excluded URLs).
Sitebulb has a wide range of options for limiting the URLs that it crawls, allowing you a great degree of control over the crawler. If you are crawling a particularly large website, these can become invaluable.
It might take a little trial and error to get the Project set up exactly as you want it, but once it is you'll be ready to go for all future Audits.