How to crawl really fast

Sitebulb is designed to be a responsible crawler, with rate limiting in place that slows the tool down if the server appears to be struggling.

However, this can be too limiting to some users, who really want (or need) to crawl a lot faster than the default settings allow. This document explains how you can over-ride these settings to crawl as fast as you can.

HTML Crawler > Chrome Crawler

The first thing to note, when it comes to speed, is that it is much quicker to crawl using the HTML Crawler, so if speed is a concern then use that, if possible. 

Speed Settings HTML Crawler

You can select the crawler you wish to use in the Crawler Configuration settings from the left hand menu. All the other 'speed related' settings are also in these Crawler Settings:

Crawler Settings

HTML Crawler: Speed Settings

You can adjust the speed by updating these settings:

Speed settings for the HTML Crawler

The biggest difference-maker is the number of threads you wish to use when crawling. How fast you can crawl and the affect it has on your machine is dependent upon the number of logical processors (cores) that your machine has.

Threads HTML Crawler

So in the example above, the machine only has 4 logical processors, so increasing the threads above 4 will start to hammer your CPU. You can increase this up to 16 threads, however auditing using way more threads than are actually available can lead to thread starvation, which causes your computer to slow down and sometimes crash.

Additionally, there is also a default limitation applied via the tickbox Limit URL Speed, which you can over-ride either by un-ticking the box or by changing the dropdown value for Max HTML URLs per Second.

Untick limit URL speed

This limitation exists to help you crawl responsibly, and if you want to learn more about that we suggest you read our guide on crawling responsibly.

However, if you are looking for the fastest crawl you can do with Sitebulb, do the following:

  1. Select the HTML Crawler
  2. Push 'Number of Threads' up to the maximum
  3. Untick 'Limit URL Speed'

Please note that this is still limited by the machine itself. If you buy a new computer with 16 cores, you will be able to crawl faster than a machine with 8 cores, all else being equal.

Chrome Crawler: Speed Settings

If you selected the Chrome Crawler from the Crawler Type dropdown, the Crawler Configuration page will look slightly different.

There is the option to select how many Chrome instances you wish to use for crawling. Again, this is dependent upon the number of logical processors you have on your machine, and pushing the value up may have adverse effects on your machine while crawling.

Adjusting these values will affect how fast Sitebulb is able to crawl:

  • Render Timeout - this determines how long Sitebulb will pause to wait for content to render, before parsing the HTML. The lower value you use, the faster Sitebulb will crawl.
  • Instances of Chrome - this determines how many logical processors will be used for rendering with headless Chrome, and is dependent upon the number of logical processors available on your machine (just like threads with the HTML Crawler). The higher value your use, the faster Sitebulb will crawl, within the limitations of your machine.
  • Limit URL Speed - if ticked the crawler will not exceed the maximum number of URLs per second set below.
  • Max HTML URLs per Seconds - in addition to instances of Chrome selected above, you can further limit the speed of the crawler by capping the number of URLs crawled per second. Lower speeds limit the number of concurrent connections, which helps prevent server slowdown for website users. 

If you wish to learn more about adjusting the crawling speed, we suggest you read our documentation How to control URLs/second for chrome crawler.

Word of Warning

Please note that we only recommend pushing up the speed options if you have permission to crawl and the website owner is comfortable with you crawling the website fast. Ideally, this would be a site you know can handle a high number of connections at once.

If you want to learn more about this subject, we suggest you read our guide on crawling responsibly.