How to configure your machine for crawling large websites

In most circumstances, your computer should be able to handle sites with less than 500,000 pages - but when you need to go larger than this, it requires a change of mindset.

This is the sort of task which, in the past, would have only been available with a very expensive cloud crawler, costing hundreds or even thousands of dollars. Achieving the same thing with a desktop crawler is perfectly possible, but you need to be aware of the limitations.

Using your machine for other tasks

A desktop crawler like Sitebulb uses up the resources on your local machine - it uses up processing power, RAM and threads (which are used for processing specific tasks). The more you ask Sitebulb to do, the less resources will be available for you to do other tasks.

Ideally, if you are crawling a site over 500,000 pages, you would not use your computer for anything else - just leave it to crawl (perhaps overnight). If you do need to use it, then try and stick to programs that are less memory intensive - so shut down Photoshop and at least half of the Chrome tabs you didn't even need to be open in the first place.

In reality, it almost always works best when you have a spare machine you can just stick Sitebulb on to chug away on the audit, freeing up your own machine for your other work. If this isn't an option, then a cheap VPN would be fine, or you can spin up an AWS instance and run it there instead.

The main point I am trying to make here is that you can't really expect to crawl a massive site in the background and at the same time use your computer as you normally would.

Crawling faster

When crawling bigger sites, you typically also want to crawl faster, because the audit will take a lot longer to complete. It is straightforward to increase the speed that Sitebulb will try to crawl. Go to Advanced Settings when setting up the audit, then select 'Speed'.

Increasing the number of threads is the main driver of speed increases. You can also increase the URL/second limit, or remove it completely by unticking the box. When crawling with the Chrome Crawler, the varialbe is simply the number of Instances you wish to use.

Increase audit speed

There are 2 main factors to consider when ramping up the speed of the crawler:

  1. What can your computer handle?
  2. What can the website handle?

What can your computer handle?

As per the above, when Sitebulb uses up computer resources, it will slow down other programs on your machine and make it more difficult for you to do other work. The vice versa is also true - if you using lots of other programs, this will slow down Sitebulb as it will have less resources it is able to use.

The other factors in play include:

  • Number of cores in your machine - Sitebulb uses threads for carrying out different processes. The more cores you have in your machine, the more threads are available for crawling tasks. 
  • Which audit options you select when setting up the audit - tick more boxes, and you ask it to do more work. In particular, crawling with the Chrome Crawler is MUCH more resource heavy than the HTML Crawler.
  • Hard disk - Sitebulb writes data to disk as it is collected, so read/write speed becomes a factor when you want to crawl faster. A solid state drive (SSD) is much faster than a hard disk drive (HDD), and is definitely recommended when crawling big sites.
  • Available RAM - although Sitebulb writes to disk, RAM is still used in the process. As you crawl faster, this pushes more data into RAM before it gets written to disk.
  • Connection - the speed and stability of the audit is affected by how your comuter is connected to the internet. A hard wired LAN connection is best, WiFi is acceptable, and obviously a mobile connection is the worst.
  • Bandwidth - the available bandwidth clearly has a big impact on speed, if you are on a 2 mbps line then you can't expect a blazing fast crawl speed. Similarly, bear in mind the actions of your colleagues. Are they all themselves crawling websites, or uploading a bunch of website design changes, or streaming the latest Peaky Blinders episode...
  • Anti-virus/firewall - some anti-virus software will literally check every URL that is downloaded by your system. This adds an additional 'wait' time to every URL crawled, and can easily impact the crawl speed.
  • Proxy connection - if your company forces all internet connections to route through a proxy or VPN, you are adding a whole extra HTTP layer for every single outgoing and incoming connection, which can vastly increase the overall crawling time.

What can the website handle?

We have written at length about crawling responsibly, and this is even more important when you are ratcheting the speed up.

When trying to determine how fast you can push it on any given website, some trial and improvement is required. You need to become familiar with the crawl progress screen and the data it is showing you.

Crawl in progress

In particular:

  • Average speed - is this staying steady, or fluctuating wildy? Does it appear to be slowing down as you progress through the audit?
  • TTFB - are these values consistent, or do they creep up over time?
  • Timeouts/errors - are you seeing lots of timeouts and errors coming through?   

You are looking for a steady, consistent, predictable audit with few errors. If this is what you are seeing, you can start to slowly increase the threads and speed limits. Do this by pausing the audit and selecting to 'Update audit settings'. Once you resume the audit, re-check the crawl progress and make sure it remains steady, and check to see if the speed has increased. Iterate through this process a number of times, incrementally increasing the threads each time, until you are comfortable with the speed.

Some other factors can also affect the speed:

  • Crawling external links - this is particularly relevant if you have a site that links out to the same external site on every single page with unique URLs. This would mean that you are pinging this site almost at the same speed you are crawling your own site. Just because your site can take the speed, does not mean that the external site can as well. Consider switching off External Links in the Advanced Settings.
  • The website/server - sometimes, as it notices the barrage of requests coming from you, the server starts to throttle you, or something in the network starts to throttle the connection. You are unlikely to be able to resolve this without talking to the system admins for the site, as they may be able to whitelist you to crawl without throttling.
  • CDNs - similarly, a CDN like Cloudflare can slow down an audit by throttling connections or holding onto them longer than usual. Again, you might be able to get whitelisted.
  • Hosting - the geolocation of the server has an impact on speed. If the website is hosted in the USA, yet you are crawling from the UK, you are making a much longer round-trip to collect data every time, which inevitably slows down the crawl.

The main advice on offer here is to treat the process as an experiment, and learn over time the best way to crawl each website with the machine(s) you have available.