URL Exclusions

Sometimes you may find that you want (or need) to restrict the crawler in order for it to crawl the website as you want it to be crawled.

For example:

  • Certain folders on the website account for thousands of extra pages, and offer very little insight, so you would prefer not to crawl them.
  • You are working on a specific section of the website, so you only want to crawl certain areas of it.
  • When the crawler encounters faceted navigation, it spawns thousands (or millions!) of variations of effectively the same URL, which all get added to the crawl queue and take forever to crawl.
  • The website adds click path parameters to every URL, which also have a canonical to the original URL, so your website audit becomes much less usable.

In all of these cases, you will need to configure the crawler to exclude certain URLs so that they do not end up being added to the crawl queue, which you can do via the URL Exclusions option from the left hand menu of the audit setup.

URL Exclusions

As you scroll down the right hand side, you will see that there are 5 different ways to exclude URLs, each of which is covered below.

URL to Exclude

Using Excluded URLs is a method for restricting the crawler, and this method allows you to specify URLs or entire directories to avoid.

Any URL that matches the excluded list will not be crawled at all. This also means that any URL only reachable via an excluded URL will also not be crawled, even if it does not match the excluded list.

The list is pre-filled with some common patterns, which you can either over-write or add to using the lines underneath. As an example, if I were crawling the Sitebulb website and wanted to avoid all the 'Product' pages, I would simply add the line:
/product/

URL to Exclude

URLs to Include

Using Included URLs is a method for restricting the crawler, and this method allows you to restrict the crawl to only the URLs or directories specified.

As an example, if I were crawling the Sitebulb website and only wanted to crawl the 'Product' pages, I would simply add the line:
/product/

URLs to include

It is worth noting a couple of things:

  • Excluded URLs over-ride included URLs, so ensure your rules do not clash.
  • Your Start URL must contain at least one link to an included URL, otherwise the crawler will simply crawl 1 URL and then stop.

Query String Parameters

By default Sitebulb will crawl all internal URLs with query string parameters. However, on some sites you may wish to avoid this, such as on sites with a large, crawlable, faceted search system.

To stop Sitebulb crawling all URLs with (any) parameters at all, untick the 'Crawl Parameters' box. In the box below for 'Safe Query String Parameters', you can add in parameters which you do want Sitebulb to crawl, such as pagination parameters (e.g. 'page' or 'p').

Exclude query string parameters

URL Rewriting

URL Rewriting is a method for instructing Sitebulb to modify URLs it discovers on the fly. It is most useful when you have a site that appends parameters to URLs in order to track things like the click path. Typically these URLs are canonicalized to the 'non-parameterized' version, which really just completely mess up your audit...unless you use URL rewriting.

You use URL Rewriting to strip parameters, so for example:

  • https://sitebulb.com/category/?ut_source=megamenu

Can become:

  • https://sitebulb.com/category/

And you end up with 'clean' URLs in your audit.

To set up the example above, you would enter the parameter 'ut_source' in the box in the middle of the page. If you also wish to add other parameters, add one per line.

Alternatively, the top tickboxes at the top allow you to automatically rewrite all upper case characters into lower case, or remove ALL parameters, respectively. The latter option means you do not need to bother writing parameters into the box, it will just strip everything.

Exclude parameters

Then, you can test your settings at the bottom by entering example URLs.

Test Excluded URLs

Block Third Party URLs

Third party URLs, like ads and tracking scripts can cause audits to get bloated with URLs that generally don't need to be audited. Blocking them will result in a much cleaner audit, so Sitebulb does this by default - untick this option to include them in your audit.

Hubspot platform in particular can spawn tons of tracking scripts, so again this is blocked by default.

Block third party URLs

There is an instance where you may wish to unblock tracking scripts in order to get better audit data, and this is when the website is set up to dynamically insert or change on-page data via Google Tag Manager (read about how to set this up here).