There are numerous ways in which you can stop Sitebulb from crawling specific URLs, paths or domains. This guide consolidates all of these methods, to help you understand which rules you will need to customise your crawl.
As a core premise, internal and external URLs are treated differently by the software, so they will each have their own section below.
Excluding Internal URLs
There are 2 ways in which you can exclude particular internal URLs from being crawled:
- Excluding parameters
- Excluding specific URLs or paths
Both of these are found in the advanced settings, which you set up before you start the audit.
To exclude parameters, you need to navigate to the URLs -> Parameters tab and UNTICK the box Crawl Parameters (which will be ticked by default).
This will then exclude any parametrised URLs found during the audit (e.g. example.com/shoes/amazing-shoe?colour=red). You can further tweak these settings by adding specific parameters that you do want to crawl in the box underneath (e.g. entering 'colour' would mean the above URL would actually be crawled).
To exclude specific paths or URLs, you would need to navigate to the URLs -> Excluded URLs tab, then enter any paths or URLs you want the crawler to avoid.
So if you wanted to exlude the entire 'shoes' folder, you could do this by simply adding the rule:
Excluding External URLs
When it comes to external URLs, it is worth noting that Sitebulb does not actually 'crawl' them in the first place - it merely does a HTTP status check on them. This allows you to check for broken links and redirects, without extracting and following links from another website (and accidentally crawling the entire internet...).
Excluding external URLs can be controlled in two different sections:
- In the advanced settings, which only affects a specific audit
- In the global settings, which affects every audit
In the advanced settings, in the Crawler -> Configuration tab you have the option to switch off either of the options 'Check Subdomain Link Status' or 'Check External Link Status'.
In this case, subdomains refers to any subdomain of the start URL's root domain (e.g. shoes.example.com would be a subdomain for example.com). And external links are any external URLs that are not subdomains.
I'll give a quick example of how this works in practice. Consider our sister site, URL Profiler, this site contains a number of links to the subdomain support.urlprofiler.com, and a fair amount of 'regular' external links.
By default, all of these will be crawled, so External Links contains a mix of subdomains and regular external URLs.
If we now untick Check Subdomain Link Status, the subdomain URLs are no longer in the audit:
Inversely, if we tick the subdomains again, but untick Check External Link Status then we will only see the subdomain links:
Finally, if we untick both of them, Sitebulb will not collect any external links at all.
These settings will only apply for this specific audit (or any subsequent re-audits). If you set up a brand new project, by default these 2 boxes would be ticked.
While the above options give you most of the flexibility you need, sometimes you may require a bit more control. For instance, if you DID want to crawl external links and get their status codes, but DID NOT want to do this for a specific domain.
The URL Profiler site, for instance, links out to t.co a bunch of times:
In order to exclude only these t.co links, you need to go to the global settings, navigate to Excluded External URLs and add 't.co' to the Excluded Hosts.
The typical use case for this is if you do want to check external links in general, but you know that you have tens or hundreds of thousands of links to a specific domain and you don't want them included in your audit as they make it more difficult to navigate. For instance, social sharing links on every single product page of an ecommerce store.
Excluding external subdomains
A quick note on external subdomains, as they are treated differently to 'internal' subdomains (i.e. subdomains of the start URL).
Consider these external links to Majestic's site from URL Profiler:
If I only wanted to exclude the link to the blog subdomain, I would need to add this rule to the Excluded Hosts:
But if I wanted to exclude all of the links in the table above, I would need to add this rule to the Excluded Hosts:
Excluding external paths
By adding paths to the list of Excluded Paths you will stop any external URLs that include these paths from being scheduled and checked by the Sitebulb crawler.
Adding in 'tweet' would exclude:
- Any URLs that had /tweet/ in the folder name (e.g. https://example.com/tweet/abc)
- Any URLs that had tweet in the filename (e.g. https://example.com/abc/tweet.php)
You can limit this to make it more specific, for instance adding 'tweet.php' will only match URLs with that specific string.