Crawler Configuration

For certain websites or audits, you may wish to have more control in how Sitebulb will crawl, which you can do via the Advanced Settings.

To get to Advanced Settings, you scroll to the bottom of the main Audit setup page and hit the grey Advanced Settings button.

The Configuration section is under Crawler -> Configuration

Crawler Configuration

These options give you various ways to control some of the checks that Sitebulb does by default:

  • HTTP Response Timeout - The maximum time for a resource to respond before Sitebulb times out and moves on to the next one.
  • Redirects to Follow - For instances where Sitebulb discovers chained redirects, this determines how many it will follow before stopping.
  • Check Duplicate ContentThis is on by default, and means that Sitebulb will perform duplicate content analysis across all indexable, HTML URLs on the website. If you are crawling a particularly big site, this is one option you can switch off to massively save on resources.
  • Check Readability This is on by default, and means that Sitebulb will perform readability and sentiment analysis across all indexable, HTML URLs on the website. If you are crawling a particularly big site, this is one option you can switch off to save on resources. In particular, you may wish to do this is if the website is in a language other than English (since the reading and sentiment checks only work in English).
  • Analyse Links - This is on by default, and means that Sitebulb will save incoming and outgoing links, along with anchor text, for all URLs. If you are crawling a particularly big site, this is one option you can switch off to massively save on resources.
  • Check subdomain URLs - This is on by default, and means that Sitebulb will check the HTTP status of any linked URLs that are on a subdomain of the main root domain. Note that it does not actually crawl them, just checks HTTP status.
  • Check External URLs This is on by default, and means that Sitebulb will check the HTTP status of any linked external URLs. Note that it does not actually crawl them, just checks HTTP status.
  • Enable Cookies - Checking this option will persist cookies throughout the crawl, which is necessary for crawling some websites.