Crawl Maps - FAQs
This post is a collection of FAQs regarding Crawl Maps, to help you better understand what they are showing you and how to interpret them.
We have a separate article with a bunch of Crawl Map examples, which includes some common patterns you can spot using Crawl Maps, such as pagination chains.
What do the dots and lines indicate?
In graph theory parlance, a dot is referred to as a 'node' (or 'vertex'), and a line is referred to as an 'edge.'
In terms of what these represent in your Crawl Map:
- Node = URL
- Edge = Link between one URL and another
A Crawl Map is a type of 'directed graph', meaning that there is a direction associated with each edge. In particular, the direction is of an outgoing link from one URL to another.
You'll notice that any two nodes are only ever joined by one edge, even if a URL has many inlinks on the website. Essentially, all you are seeing is the first link found by the crawler, during its crawl. This is because the Crawl Map reflects the way in which Sitebulb crawled the website.
Since Sitebulb crawls using a breadth-first method, this means that URLs will appear to be linked from their 'highest' link, in terms of crawl depth.
For example, if you have product pages which are generally linked from sub-category pages, these would be 3 levels deep, but if a few of these products are also linked from the homepage, those specific URLs would display as 1 level deep.
Does Sitebulb show every URL crawled?
In early versions of Sitebulb we did this, but as soon as you crawl a site that is even remotely large, the Crawl Map becomes ridiculously unmanageable. This is an example of an early Crawl Map that we considered 'not that bad' at the time...
As a result, we implemented a number of rules to restrict which URLs can be shown in Crawl Maps:
- Only show internal, indexable, HTML URLs.
- Only show URLs with incoming links (i.e. no orphan URLs).
Even with these rules in place, we found that some Crawl Maps would just shoot off in all directions, never-ending pagination chains causing crazy spiderweb graphs.
To combat this issue, we also added some rules to govern how many nodes are shown for each crawl depth (level).
For example, these are the rules used most of the time:
- Show all URLs at depth 0
- Show all children URLs at depth 1
- Show up to 50 children URLs at depths 2, 3, 4 and 5
- Show up to 10 children URLs at depths 6-10
- Show 1 child URL at depth 10+
There are slightly different rules used when the node count increases over 10,000 and 20,000, such that less nodes are shown.
What do the colours and sizes of the nodes mean?
Both the colour and the size of each node reflects the crawl depth (level) of the URL in the website crawl.
So the big green node is Depth 0 (the homepage), the lighter green nodes are Depth 1 (URLs linked from the homepage), and so on and so forth.
What does the distance between nodes mean?
The distance between nodes (i.e the length of the lines) does not signify anything.
Some lines are bigger than others simply as a means to arrange the content of the graph and fit it all in.
Why does my Crawl Map have very few nodes?
This can happen for a number of reasons, normally due to the technical setup of the website in question.
Crawl Maps only include indexable URLs, so if you have a canonicalized homepage you can end up with an empty Crawl Map. Similarly, if you have lots of nofollow or canonicalized URLs linked from the homepage, the mapping of the Crawl Map may be stunted.
If you think your Crawl Map is missing major chunks of your website, then you can typically find the answer by looking at how canonicals are used across the site (see the Indexability report).
Is a Crawl Map a visualization of my sitemap?
No. It is a visualization of the crawl carried about by Sitebulb.
Sitebulb uses a breadth-first search (BFS) method in order to crawl your website. This means that when it crawls your homepage (depth 0), it finds all the links and adds these to the crawl scheduler. All of these pages will be depth 1, as they are linked directly from the homepage. Sitebulb will crawl each of these depth 1 pages, and extract all of their links (to depth 2 pages), completing all depth 1 pages first before moving on to the depth 2 pages.
It is this process that is mapped by a Crawl Map, so it can be considered a representation of your site architecture, but it is not a sitemap.
Do you have a glossary of terms for the overlay?
Sure! You'll find it below. First, here's a quick shot to remind you of the data we display:
- URL: The URL which the node represents.
- Title: The page title of the URL in question.
- URL Crawl Depth: The minimum number of clicks from the homepage that are required to reach the URL in question.
- First Found On: The 'source' URL, which was the first page Sitebulb encountered while crawling that linked to the URL in question.
- Link Equity Score: The Link Equity Score of the URL in question, which is a value out of 10 that is equivalent to an internal PageRank score. Pages with more links from other authority pages on the site will have higher Link Equity Score.
- Unique Followed Links: The total number of unique internal URLs which contain at least one followed link to the URL in question.
- Followed Links: The total number of followed links to the URL in question, from other internal URLs on the website. In this instance, 3 different pages link to the page, a total of 5 times (so some pages contain more than one link to this page).
- Links from: The percentage of all internal pages which link to the URL in question. In this case there are 181 total pages, and only 3 link to this URL. (3/181)*100=1.66, which rounds up to 2%.
- Children URLs: The number of direct 'children' URLs which are 'owned' by the URL in question. A given URL is designated as a 'child' if the 'parent' page is the first one to link to it, as discovered by Sitebulb during the crawl. If a URL does have children, you should see nodes spurring off it in the Crawl Map.