Sometimes you may find that you want (or need) to restrict the crawler in order for it to crawl the website as you want it to be crawled.
For example:
In all of these cases, you will need to configure the crawler to exclude certain URLs so that they do not end up being added to the crawl queue, which you can do via the Include & Exclude URLs option from the left-hand menu of the audit setup.
As you navigate through the tabs in this window, you will see that there are 6 different ways to limit the crawl, each of which is covered below.
Using Internal URL Exclusions is a method for restricting the crawler that allows you to specify URLs or entire directories to avoid. This tab also allows you to select URL Query String Parameters exclusion settings.
Any URL that matches the excluded list will not be crawled at all. This also means that any URL only reachable via an excluded URL will not be crawled, even if it does not match the excluded list.
As an example, if I were crawling the Sitebulb website and wanted to avoid all the 'Product' pages, I would simply add the line:
/product/
By default, Sitebulb will crawl all internal URLs with query string parameters. However, on some sites, you may wish to avoid this. For example, on sites with a large, crawlable, faceted search system.
To stop Sitebulb from crawling all URLs with (any) parameters at all, untick the 'Crawl Parameters' box. In the box below for 'Safe Query String Parameters', you can add in parameters which you do want Sitebulb to crawl, such as pagination parameters (e.g. 'page' or 'p').
Using Internal URL Paths Inclusion settings is a method for restricting the crawler to only the URLs or directories specified.
As an example, if I were crawling the Sitebulb website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/
It is worth noting a couple of things:
For any URLs included in the seed list, Sitebulb will also parse the HTML on these pages and extract links - in addition to the Start URL and any other pages crawled.
In conjunction with the inclusion paths you have listed, this feature can be useful when using inclusion rules in cases where the start URL doesn't contain links to all the paths you wish to crawl.
You can also use this feature, in conjunction with Subdomain Options, to ensure all your subdomains are found and crawled.
URL Rewriting is a method for instructing Sitebulb to modify URLs it discovers on the fly. It is most useful when you have a site that appends parameters to URLs in order to track things like the click path. Typically these URLs are canonicalized to the 'non-parameterized' version, which really just completely mess up your audit...unless you use URL rewriting.
You use URL Rewriting to strip parameters, so for example:
Can become:
And you end up with 'clean' URLs in your audit.
To set up the example above, you would enter the parameter 'ut_source' in the box. If you also wish to add other parameters, add one per line.
Then, you can test your settings at the bottom by entering example URLs.
Alternatively, the top tickboxes at the top allow you to automatically rewrite all upper case characters into lower case, or remove ALL parameters, respectively. The latter option means you do not need to bother writing parameters into the box, it will just strip everything.
The Include & Exclude URLs settings also allow you to set exclusions for external domains at the audit level. On the External Domain Exclusions tab you can specify domains that should be excluded in the External Page analysis or from Page Resources.
If you want to exclude both links and page resources from a certain domain, you will need to include it in both lists.
Third party URLs, like ads and tracking scripts can cause audits to get bloated with URLs that generally don't need to be audited. Blocking them will result in a much cleaner audit, so Sitebulb does this by default - untick this option to include them in your audit.
Hubspot platform, in particular, can spawn tons of tracking scripts, so again this is blocked by default.
There is an instance where you may wish to unblock tracking scripts in order to get better audit data, and this is when the website is set up to dynamically insert or change on-page data via Google Tag Manager (read about how to set this up here).