Advanced Robots Settings

In order to crawl certain websites, you may need to adjust some of the default robots settings, which you can do via the Advanced Settings.

To get to Advanced Settings, you scroll to the bottom of the main Audit setup page and hit the grey Advanced Settings button.

The Authentication section is under Crawler -> Robots:

Robots Settings

Robots Directives

By default, the Sitebulb crawler will respect robots directives, but you can over-ride this by unticking the box 'Respect Robots Directives'.

This will spawn 3 new options, which will allow you to control more specifically which robots directives are crawled.

Respect Robots

  • Crawl Disallowed URLs - The crawler will ignore disallowed directives in robots.txt.
  • Crawl Internal Nofollow - The crawler will ignore any nofollow directives on internal links.
  • Crawl External NofollowThe crawler will ignore any nofollow directives on external links.

Additionally, this section will allow you to specify the following:

  • Save Disallowed URLs - Off by default, ticking this will cause Sitebulb to save any disallowed URLs as 'Uncrawled'.
  • Don't Crawl Canonical URLs - Off by default, ticking this will stop Sitebulb scheduling and crawling any URLs found via the canonical link element.
  • Don't Crawl Alternate URLs - On by default, ticking this will stop Sitebulb scheduling and crawling any URLs found via the alternate link element.
  • Don't Crawl Pagination URLs - Off by default, ticking this will stop Sitebulb scheduling and crawling any URLs found via rel="next" or "prev", in the HTML or in the HTTP Header.

User Agent

By default, Sitebulb will crawl using the Sitebulb user agent, but you can change this by selecting a different one from the dropdown, which contains a number of preset options.

Change user agent

Virtual robots.txt

This setting allows you to over-ride the website's robots.txt file, to instead use a 'virtual robots.txt' file.

Virtual Robots

To use it, click the green button 'Fetch Current Robots.txt', which will populate the box above with the current robots.txt directives.

Then just delete or adjust the existing directives, or add new lines underneath. Sitebulb will follow these directives instead of the original ones.