How to crawl really fast

Sitebulb is designed to be a responsible crawler, with rate limiting in place that slows the tool down if the server appears to be struggling.

However, this can be too limiting to some users, who really want (or need) to crawl a lot faster than the default settings allow. This document explains how you can over-ride these settings to crawl as fast as you can.

HTML Crawler > Chrome Crawler

The first thing to note, when it comes to speed, is that it is much quicker to crawl using the HTML Crawler, so if speed is a concern then use that, if possible.

HTML Crawler is faster

The second thing to note is that all the other 'speed related' settings are in the Advanced Settings, which are down here:

Advanced Settings

HTML Crawler: Speed Settings

The speed settings are under the Crawler -> Speed tab in the Advanced Settings:

Speed Settings

You can select the number of threads you wish to use when crawling. How fast you can crawl and the affect it has on your machine is dependent upon the number of logical processors (cores) that your machine has.

Threads HTML Crawler

So in the example above, the machine only has 4 logical processors, so increasing the threads above 4 will start to hammer your CPU. You can increase this up to 16 threads, however auditing using way more threads than are actually available can lead to thread starvation, which causes your computer to slow down and sometimes crash

Additionally, there is also a default limitation applied via the tickbox Limit URL Speed, which you can over-ride either by un-ticking the box or by changing the dropdown value for Max HTML URLs per Second.

Untick thread limits

This limitation exists to help you crawl responsibly, and if you want to learn more about that we suggest you read our guide on crawling responsibly.

However, if you are looking for the fastest crawl you can do with Sitebulb, do the following:

  1. Select the HTML Crawler
  2. Push 'Number of Threads' up to the maximum
  3. Untick 'Limit URL Speed'

Please note that this is still limited by the machine itself. If you buy a new computer with 16 cores, you will be able to crawl faster than a machine with 8 cores, all else being equal.

Chrome Crawler: Speed Settings

If you selected the Chrome Crawler from the Crawler Settings, the first screen of the Advanced Settings will look slightly different.

There are no thread options or URL speed limiting options (as neither are applicable when crawling with Chrome). Instead, there is simply the option to select how many Chrome instances you wish to use for crawling. Again, this is dependent upon the number of logical processors you have on your machine, and pushing the value up may have adverse affects on your machine while crawling.

Chrome Instances

Word of Warning

Please note that we only recommend pushing up the speed options if you have permission to crawl and the website owner is comfortable with you crawling the website fast. Ideally, this would be a site you know can handle a high number of connections at once.

If you want to learn more about this subject, we suggest you read our guide on crawling responsibly.