Traditionally, a crawler would work by extracting data from static HTML code, and up until recently, most websites you would encounter could be crawled in this manner.
However, if you try to crawl a website built with Angular in this manner, you won't get very far (literally). In order to 'see' the HTML of a web page (and the content and links within it), the crawler needs to process all the code on the page and actually render the content.
Google handles this in a 2-phase approach. Initially they crawl and index based on the static HTML (the 'first wave' of indexing). Then, once they have resources available in order to render the page, they perform a second wave of indexing based on the rendered HTML.
A friend of mine runs a website built in Backbone, and his website provides a great example to see what's going on.
Consider the product page for their most popular product. With Chrome -> Inspect we can see the h1 on the page:
However, if we just view-source on this page, there is no h1 in sight:
Scrolling further down the view-source page will just show you a bunch of scripts and some fallback text. You simply can't see the meat and bones of the page - the product images, description, technical spec, video, and most importantly, links to other pages.
So if you tried to crawl this website in the traditional manner (using the 'HTML Crawler'), all of the data a crawler would usually extract is essentially invisible to it. And this is what you would get:
If Google came in for their first wave of indexing, all they would find is this one page. And one page with very little on it, at that.
In essence, the crawler needs to pretend to be a browser, let all the content load, and only then go and get the HTML to parse.
This is why you need a modern crawler, such as Sitebulb, set in Chrome Crawler mode, to crawl websites like this.
Every time you set up a new Project in Sitebulb, you need to choose the Analysis settings, such as checking for AMP or calculating page speed scores.
In these cases, a new settings option will appear, the 'Render Timeout.'
Most people probably won't know what the Render Timeout refers to or how to set it. If you'd prefer not to know, skip the section below and just leave it at the recommended 5 seconds. Otherwise, read on.
What is this Render Timeout?
The render timeout is essentially how long Sitebulb will wait for rendering to complete before taking an 'HTML snapshot' of each web page.
The 'Render Timeout' period used by Sitebulb starts just after #1, the Initial Request. So essentially, the render timeout is the time you need to wait for everything to load and render on the page. Say you have the Render Timeout set to 4 seconds, this means that the each page has 4 seconds for all the content to finish loading and any final changes to take effect.
Anything that changes after these 4 seconds will not be captured and recorded by Sitebulb.
Render Timeout Example
I'll demonstrate with an example, our friends again at Bailey of Sheffield. If I crawl the site with no render timeout at all, I get a total of 30 URLs. If I use the 5 second timeout, I get 51 URLs, almost twice as many.
(The Audit with 1 URL crawled, if you recall, was from crawling with the HTML Crawler).
Digging into a little more detail about these two Chrome Crawls, there were 14 more Internal HTML URLs found with the 5 second timeout. This means that, in the Audit with no render timeout, the content which contains links to those URLs had not been loaded when Sitebulb took the snapshot.
Clearly, this can have a profound impact upon your understanding of the website and its architecture, which can be highlighted by comparing the two crawl maps:
In this instance, it was very important to set the Render Timeout in order for Sitebulb to see all of the content.
Recommended Render Timeout
Understanding why the Render Timeout exists does not actually help us decide what to set it at. We have scoured the web for confirmation from Google about how long they wait for content to load, but we haven't found it anywhere.
What we did find however was that most people seem to concur that 5 seconds is generally considered to be 'about right.' Until we see anything concrete from Google, or have a chance to perform some more tests of our own, we'll be recommending 5 seconds for the Render Timeout.
But all this will show you is an approximation of what Google may be seeing. If you want to crawl ALL the content on your site, then you'll need to develop a better understanding of how the content on your website actually renders.
To do this, we'll return to Chrome's DevTools Console. Right click on the page and hit 'Inspect', then select 'Network' from the tabs in the Console, and then reload the page. I've positioned the dock to the right of my screen to demonstrate:
Keep your eye on the waterfall graph that builds, and the timings that are recorded in the summary bar at the bottom:
So we have 3 times recorded here:
- DOMContentLoaded: 727 ms (= 0.727 s)
- Load: 2.42 s
- Finish: 4.24 s
You can find the definitions for 'DOMContentLoaded' and 'Load' from the image above that I took from Justin Briggs' post. The 'Finish' time is exactly that, when the content is fully rendered and any changes or asynchronous scripts have completed.
Bear in mind that so far we've only looked at a single page. To develop a better picture of what's going on, you'd need to check a number of pages/page templates and check the timings for each one.
If you are going to be crawling with the Chrome Crawler, we urge you to experiment further with the render timeout so you can set your Projects up to correctly crawl all your content every time.
And there are actually a couple of downsides to crawling with the Chrome Crawler, for example:
- As a direct result of #1 above, crawling with the Chrome Crawler is slower than with the HTML Crawler, particularly if you have set a long Render Timeout. On some sites, and with some settings, it can end up taking 6-10 X longer to complete.
Obviously the first port of call, you can save time doing discovery work with a thorough briefing with the client or their dev team.
However, whilst it is nice to think that every client briefing would give you this sort of information up front, I know from painful experience that they are not always forthcoming with seemingly obvious details...
Trying a Crawl
Ploughing head first into an Audit with the Chrome Crawler is actually not going to cost you too much time, since even the most 'niche' websites have more than a single URL.
It is certainly worth bearing in mind though, in case you are a set-it-and-forget-it type, or you tend to leave Sitebulb on overnight with a queue of websites to Audit... by the morning you'd be bitterly disappointed.
You can also use Google's tools to help you understand how a website is put together. Using Google Chrome, right click anywhere on a web page and choose 'Inspect' to bring up Chrome's DevTools Console.
Then, leave the DevTools Console open and refresh the page. Does the content stay exactly the same, or does it all disappear?
This is what happens in my Bailey of Sheffield example:
Notice anything missing?