How to stop Sitebulb from crawling specific URLs

There are numerous ways in which you can stop Sitebulb from crawling specific URLs, paths or domains. This guide consolidates all of these methods, to help you understand which rules you will need to customise your crawl.

As a core premise, internal and external URLs are treated differently by the software, so they will each have their own section below.

Excluding Internal URLs

There are 2 ways in which you can exclude particular internal URLs from being crawled:

  1. Excluding parameters
  2. Excluding specific URLs or paths

Both of these are found in the advanced settings, which you set up before you start the audit.

Excluding parameters

To exclude parameters, you need to navigate to the URLs -> Parameters tab and UNTICK the box Crawl Parameters (which will be ticked by default).

Exclude Parameters

This will then exclude any parametrised URLs found during the audit (e.g. example.com/shoes/amazing-shoe?colour=red). You can further tweak these settings by adding specific parameters that you do want to crawl in the box underneath (e.g. entering 'colour' would mean the above URL would actually be crawled).

Excluding paths

To exclude specific paths or URLs, you would need to navigate to the URLs -> Excluded URLs tab, then enter any paths or URLs you want the crawler to avoid.

Excluded URLs

So if you wanted to exlude the entire 'shoes' folder, you could do this by simply adding the rule:

/shoes/

Excluding External URLs

When it comes to external URLs, it is worth noting that Sitebulb does not actually 'crawl' them in the first place - it merely does a HTTP status check on them. This allows you to check for broken links and redirects, without extracting and following links from another website (and accidentally crawling the entire internet...).

Excluding external URLs can be controlled in two different sections:

  • In the advanced settings, which only affects a specific audit
  • In the global settings, which affects every audit

Advanced Settings

In the advanced settings, in the Crawler -> Configuration tab you have the option to switch off either of the options 'Check Subdomain Link Status' or 'Check External Link Status'. 

Exclude external links or subdomains

In this case, subdomains refers to any subdomain of the start URL's root domain (e.g. shoes.example.com would be a subdomain for example.com). And external links are any external URLs that are not subdomains.

I'll give a quick example of how this works in practice. Consider our sister site, URL Profiler, this site contains a number of links to the subdomain support.urlprofiler.com, and a fair amount of 'regular' external links.

By default, all of these will be crawled, so External Links contains a mix of subdomains and regular external URLs.

Mix of external links and subdomain links

If we now untick Check Subdomain Link Status, the subdomain URLs are no longer in the audit:

No subdomain links

Inversely, if we tick the subdomains again, but untick Check External Link Status then we will only see the subdomain links:

Subdomains only

Finally, if we untick both of them, Sitebulb will not collect any external links at all.

No external links

These settings will only apply for this specific audit (or any subsequent re-audits). If you set up a brand new project, by default these 2 boxes would be ticked.

Global settings

While the above options give you most of the flexibility you need, sometimes you may require a bit more control. For instance, if you DID want to crawl external links and get their status codes, but DID NOT want to do this for a specific domain.

The URL Profiler site, for instance, links out to t.co a bunch of times:

Links to t.co

In order to exclude only these t.co links, you need to go to the global settings, navigate to Excluded External URLs and add 't.co' to the Excluded Hosts.

Exclude global settings

The typical use case for this is if you do want to check external links in general, but you know that you have tens or hundreds of thousands of links to a specific domain and you don't want them included in your audit as they make it more difficult to navigate. For instance, social sharing links on every single product page of an ecommerce store.

Excluding external subdomains

A quick note on external subdomains, as they are treated differently to 'internal' subdomains (i.e. subdomains of the start URL).

Consider these external links to Majestic's site from URL Profiler:

Majestic URLs

If I only wanted to exclude the link to the blog subdomain, I would need to add this rule to the Excluded Hosts:

  • blog.majestic.com

But if I wanted to exclude all of the links in the table above, I would need to add this rule to the Excluded Hosts:

  • majestic.com

Excluding external paths

By adding paths to the list of Excluded Paths you will stop any external URLs that include these paths from being scheduled and checked by the Sitebulb crawler.

Adding in 'tweet' would exclude:

  • Any URLs that had /tweet/ in the folder name (e.g. https://example.com/tweet/abc)
  • Any URLs that had tweet in the filename (e.g. https://example.com/abc/tweet.php)

You can limit this to make it more specific, for instance adding 'tweet.php' will only match URLs with that specific string.