Advanced Settings: Crawler

To go to Advanced Settings, you scroll to the bottom of the main settings page and hit the 'Advanced Settings' button on the left.

The Advanced Settings are split up using a double-tabbed system. The tabs at the top are split into: 'Crawler',  'Authorization' and 'Robots'.

This article will focus only on the 'Crawler' settings, starting with the first tab, 'Limits':

Crawler: Limits

Crawler Limits (Advanced Settings)

These options give you various ways to limit the crawler so that it crawls less pages/content.

  • Maximum URLs to Audit - The total number of URLs Sitebulb will Audit - not that this will include external URLs, and page resource URLs. Once it hits this limit, Sitebulb will stop crawling and generate the reports.
  • Maximum Download Size - The maximum size of the data Sitebulb will download for each URL. If you have some pages with a lot of HTML content, some of them may be limited by this setting.
  • Maximum Crawl Depth - The number of levels deep Sitebulb will crawl (where the homepage is 0 deep, and all URLs linked from the homepage are 1 deep, and so on).
  • HTTP Response Timeout - The maximum time for a resource to respond before Sitebulb times out and moves on to the next one.
  • Redirects to Follow - For instances where Sitebulb discovers chained redirects, this determines how many it will follow before stopping.
  • Analyse Links - This is on by default, and means that Sitebulb will save incoming and outgoing links, along with anchor text, for all URLs. If you are crawling a particularly big site, this is one option you can switch off to massively save on resources.
  • Check subdomain URLs - This is on by default, and means that Sitebulb will check the HTTP status of any linked URLs that are on a subdomain of the main root domain. Note that it does not actually crawl them, just checks HTTP status.
  • Check External URLsThis is on by default, and means that Sitebulb will check the HTTP status of any linked external URLs. Note that it does not actually crawl them, just checks HTTP status.
  • Enable Cookies - This will persist cookies throughout the crawl, which is necessary for crawling some websites.

Crawler: Languages

Accept Language SettingsThis setting is only necessary if you're crawling an international site (i.e. a site with different content for different regions/languages), and even more specifically - only a site that redirects the user based on the Accept-Language found in the request headers.

When this does apply, you will need to do some investigation of the site first, to understand what language patterns to set up. Essentially, you can tell Sitebulb to use different request headers when crawling particular URLs.

You need to set up the default language, which will be the language used in the Accept-Language header by default, and then set up the language variations using 'Add New Language.'

Excluded URLs

Excluded URLs

Using Excluded URLs is another method for restricting the crawler, and this method allows you to specify URLs or entire directories to avoid.

Any URL that matches the excluded list will not be crawled at all. This also means that any URL only reachable via an excluded URL will also not be crawled, even if it does not match the excluded list.

As an example, if I were crawling this website and wanted to avoid all the 'Product' pages, I would simply add the line:
/product/

Included URLs

Included URLs

Using Included URLs is another method for restricting the crawler, and this method allows you to restrict the crawl to only the URLs or directories specified.

As an example, if I were crawling this website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/

It is worth noting a couple of things:

  • Excluded URLs over-ride Included URLs, so ensure your rules do not clash.
  • Your Start URL must contain at least one link to an Included URL, otherwise the crawler will simply crawl 1 URL and then stop.

Proxy

Proxy Setup

There's no requirement to use a proxy to crawl with Sitebulb, but in certain circumstances it can be necessary.

To configure Sitebulb to crawl via a proxy server, simply add the host and port. If the proxy needs username/password authentication, you will need to add these below.

Note that using a proxy adds an additional HTTP layer to every request, which will make the crawler a bit slower, and data such as Site Speed less reliable.