How to Crawl Large Websites
The aim of this post is to provide a methodology for crawling large websites with Sitebulb.
Crawling large websites is a tricky subject, primarily because of the number of unknowns. Until you actually crawl a website, you don't know if you're working with a 1,000 page website or a 100,000 page website. And that is before you start thinking about embedded resources, external links and subdomains.
But when you are dealing with a big website, crawl data can be incredibly valuable, as it can reveal patterns and issues that you'd struggle to detect otherwise.
So that leads us on to an important question:
How large is a large website?
This can depend on perspective. An experienced enterprise SEO who is familiar with 1 million+ page websites might see a 5000 page site as tiny. But to a solo in-house SEO in their first job, it can feel enormous.
For the sake of argument, we'll draw the line at 100,000 URLs. This is the point at which you might need to start thinking about which crawl and analysis options you have switched on in Sitebulb.
In general, for sites smaller than 100,000 URLs, you should be pretty safe turning on whichever crawl options you like.
Worked example: Patient
We'll work through an example, say I needed to crawl the Patient website, which is a UK based health advice site, for doctors and patients alike.
If I were actually working with the client, one of my initial Q & A questions would be to ask them about scale, but since I'm not, we'll lean on our friend Google.
First things first, this does NOT mean that we will need to crawl exactly 302,000 pages. Not included in this total are noindexed URLs, disallowed URLs, canonicalized URLs, page resource URLs, external links or links to subdomains - yet all of this stuff could end up in the scope of your Audit if you are not careful.
If anything, what Google displays can only really be considered a lower bound, for the purposes of planning a crawl.
So at this point we know it's a pretty big site, so our first port of call is a Sample Audit.
Running a Sample Audit
The point of a Sample Audit is to crawl a small subset of a website, in order to get a feel for how a full Audit will go.
We first need to set up a new Project in Sitebulb.
Once we hit 'Save and Continue', Sitebulb will go off and perform a number of 'pre-audit checks', such as checking the robots.txt file to make sure we can actually crawl the website in the first place. One of the other things it does is a site: query just like we did above.
Due to the size of the website, Sitebulb recommends we start with a Sample Audit, which is exactly what we'll do.
Once you select the Sample Audit, you may notice that a few of the normal analysis options are not available (AMP, International, XML Sitemaps). This is simply because a lot of the data presented in these reports relates to checking internal consistency, which does not work well with sampled data.
In this instance we are not interested in Site Speed or Mobile Friendliness, but we will check Page Resources. This should give us an indication of how many extra pages we may need to crawl with Page Resources switched on.
Scroll down and you'll come to the Sample Crawl Settings, which is how we will limit the crawl. In this case, we are only going to crawl 10 levels deep, and a maximum of only 1500 URLs at each level (Sitebulb will choose 1500 random URLs to crawl at each level). The maximum URLs to Audit should stop the crawler at around 10,000 URLs.
We'll ramp the speed up to 10 URLs/second, which should allow us to see how fast we can actually crawl this site when we do the main Audit.
Note that 'Save Disallowed URLs' is unchecked by default. For this sample audit we'll switch it on, to see what impact it may have on the crawl.
We'll leave all the Advanced Settings as default, and set the crawler going. If we keep an eye on the progress, we can see how the crawl is going. This one seems to have settled down to about 8 URLs per second.
Viewing the Sample Audit
Once the Sample Audit has finished running, we can use the data collected to make inferences about how a full Audit would work on the site.
One thing we can clearly see is that there is roughly one page resource URL crawled, for every internal URL crawled.
This means that if we included Page Resources in our main Audit we would roughly double the amount of URLs we need to crawl, so 300,000 suddenly turns into 600,000.
Similarly, there was roughly 1 external URL crawled for every 6 internal ones, so across 300,000 pages this would translate to an extra 50,000 URLs.
The point in doing this is to help us build up a profile of the website crawl to answer the question: 'what would it look like to do a full crawl of everything on the site?'
Then we will get an idea of how long it will take, to help decide if we are willing to wait this long for all the data.
We can also look at other areas in the report which would contribute more crawl 'budget', such as redirects:
Redirects are not strictly 'crawled', but they are followed, and this still takes time. With a ratio of around 1 in 10 (to Internal URLs), we might expect there to be about 30,000 redirects to follow in the main crawl.
The reason we look at redirects is because these are not indexable URLs, so are unlikely to be included in our baseline of 302,000 we got from Google. We can do the same thing with canonicalized and noindex URLs, which would still need to be crawled, but are unlikely to be indexed.
The Indexation Report includes these details:
The data in this report is potentially game changing. Check out the two highlighted boxes above, which appear to indicate that for every indexable URL crawled, we also need to crawl a non-indexable one.
Which would inflate our crawl numbers dramatically.
In actual fact, in this case, most of the URLs that are not indexable are already being disallowed. Remember earlier when we set up the crawl, we checked the box to save disallowed URLs. When you do this, disallowed URLs are not scheduled for crawl, but they are saved and bucketed off so you can see what they are. But by doing this, you make the tool work harder collecting the data and processing the reports at the end.
In this case, when we do the full Audit, I'd recommend unticking that option.
Since there are relatively few noindex and canonicalized URLs, they will have a negligible impact on our potential crawl size, so it's safe to ignore them.
We can now build up our profile of what a full Audit might look like:
- We have a baseline of around 300,000 URLs from Google site: search
- If we decide to crawl page resources, this will add another 300,000 URLs (1 to 1)
- External URLs add another 50,000 URLs (1 to 6)
- Redirects would add another 30,000 URLs (1 to 10)
So if we wanted to crawl everything, we'd be looking at crawling around 680,000 URLs. At a rate of 8 URLs/second, this would take about a day to complete.
Limiting the Audit
At this point we have a fairly good estimate for how long a full Audit would take. If we wanted to move forward and carry out the full Audit, we could ask ourselves if we really want to wait this long, or if we'd be comfortable omitting some of the data.
There's no way to not crawl redirects, but we can exclude external URLs and page resources, which would keep the total down closer to the 300,000 we started with.
To do this, we'd start a new Project and this time select 'Standard Audit' from the dropdown instead of 'Sample Audit.'
Excluding Page Resources
In fact, these are not crawled by default, but it is easy to switch them on by accident.
If you don't want to crawl page resources, you need to leave all 3 of these options unticked:
Both the Site Speed and Mobile Friendly Analysis options require Page Resource data, so you can't have any of these options ticked.
Of course, if you really want access to these reports, then you'll have to bite the bullet and switch them on, just bear in mind how it will affect the Audit size and speed.
Excluding External Links
To find this setting you are going to have to scroll all the way to the bottom of the setup page, and hit the 'Advanced Settings' button on the far left.
Then on the first tab, untick the two boxes 'Check subdomain URLs' and 'Check External URLs' (which are both on by default).
This is the best way you can limit the size of the Audit, without deliberately excluding internal pages in the site.
Excluding Internal HTML URLs
Finally, you can also exclude internal URL paths and/or parameterized URLs through the Advanced Settings. We've written a separate guide which covers that topic, as it can be quite involved: Limiting the Crawler for Faster and Cleaner Audits.
Pause and Resume
This final note is nothing to do with how you arrange the crawl setup, but more to do with how to manage your computer. Ideally, you want to leave Sitebulb to run continuously, complete the crawl and generate all the reports.
However, with bigger Audits it may not be possible to leave your computer on for a long period (e.g. overnight), so the best option is to utilise the 'Pause' feature.
While an Audit is in progress, just hit the blue 'Pause' button in the top right.
Once you've done this, wait around 5 seconds for the purple button at the top left to change from 'Pausing' to 'Paused.'
At this point you'll notice that there is now an option to 'Resume' in the top right.
Once an Audit has been paused, you can close Sitebulb down and shut your machine down. Then when you reopen Sitebulb the next day, you'll see a message on the Dashboard informing you that you have a paused Audit.
Click through to view any paused or interrupted Audits you have.
Hit 'View Progress' to return to the progress screen, where you can then hit 'Resume' to set the Audit running again.
If pausing is not an option, and/or you want to crawl a large site continuously for several days, the best option might be to leave Sitebulb to work uninterrupted on a VPS or cloud instance. If you want instructions on how to do this with AWS, please read our guide.
A Word on Disk Space
A final consideration when crawling large websites is to many sure you have enough space on your hard drive for the audit. One of the reasons Sitebulb can crawl so many pages is because it writes the data to disk instead of holding it in RAM.
A site with 100,000+ URLs could take up anything from 250 MB to around 2 GB. There's no easy way to know how much space you'll need before you start the Audit, but bear in mind that the more data you collect, the more disk space will be required.
In particular, 'Link Analysis' can take up a lot of disk space, as the numbers can grow huge very quickly.
To put this into perspective, I crawled a site recently with 1.6 million internal URLs. I crawled it once with 'Link Analysis' switched off, and this took up 6 GB of space on my hard drive. I crawled it again and switched on 'Link Analysis', and it was 36 GB!
Why so large? 142,600,000 links is why.
View our documentation to learn how to see how much space your Audits are taking up.
- Do a sample Audit.
- Adjust your crawl settings to limit the Audit, based on said sample Audit.
- Pause and resume if you need to.
- Make sure you've got enough disk space