Updated 25 August 2021
Traditionally, a crawler would work by extracting data from static HTML code, and up until relatively recently, most websites you would encounter could be crawled in this manner.
However, if you try to crawl a website built with Angular like this, you won't get very far (literally). In order to 'see' the HTML of a web page (and the content and links within it), the crawler needs to process all the code on the page and actually render the content.
Rendering is a process carried out by the browser, taking the code (HTML, CSS, JS, etc...) and translating this into the visual representation of the web page you see on the screen.
Search engines (and crawling tools like Sitebulb) are able to do this en masse using a 'headless' browser, which is a browser that runs without the visual user interface. This works by building up the page content (i.e. 'rendering the page') then extracting the HTML after the page has rendered.
Table of contents:
- How Google handles rendering
- How Sitebulb handles rendering
- Further reading
How Google handles rendering
Since 2019, they have implemented an 'evergreen Googlebot', which means that Googlebot runs the latest Chromium rendering engine and keeps it constantly up-to-date (incidentally, we do exactly the same here at Sitebulb, so crawling with Sitebulb reflects exactly what Google sees).
Nowadays, rendering is built into Google's crawling and indexing process at a fundamental level;
The important thing to note from this diagram is that the index gets updated after rendering. Additionally, consider that Google claim they basically render every single page they encounter.
However, it is a deep, complex and interesting topic that absolutely deserves the attention of technical SEOs, so here is some more reading for you to enjoy:
How Sitebulb handles rendering
Sitebulb offers two different ways of crawling:
- HTML Crawler
- Chrome Crawler
Let's take a look...
Why only one page? Because the response HTML (the stuff you can see with 'View Source') only contains a bunch of scripts and some fallback text.
You simply can't see the meat and bones of the page - the product images, description, technical spec, video, and most importantly, links to other pages... everything a crawler needs in order to understand your page content.
On websites like this you absolutely need to use the Chrome Crawler to get back any meaningful crawl data.
Secondly, you will also need to consider the render timeout, as this affects how much of the page content Sitebulb is actually able to access.
You will find this in the Crawler Settings on the left hand side, and the Render Timeout dropdown is right underneath 'Crawler Type' on the right.
What is this render timeout?
The render timeout is essentially how long Sitebulb will wait for rendering to complete before taking an 'HTML snapshot' of each web page.
The 'Render Timeout' period used by Sitebulb starts just after #1, the Initial Request. So essentially, the render timeout is the time you need to wait for everything to load and render on the page. Say you have the Render Timeout set to 4 seconds, this means that the each page has 4 seconds for all the content to finish loading and any final changes to take effect.
Anything that changes after these 4 seconds will not be captured and recorded by Sitebulb.
Render timeout example
I'll demonstrate with an example, again using the Roku site we looked at earlier.
- In my first audit I used the HTML Crawler - 1 URL crawled
- In my second audit I used the Chrome Crawler with a 3 second render timeout - 139 URLs crawled
- In my third audit I used the Chrome Crawler was a 5 second render timeout - 144 URLs crawled
Digging into a little more detail about these two Chrome audits, there were 5 more internal HTML URLs found with the 5 second timeout. This means that, in the audit with a 3 second render timeout, the content which contains links to those URLs had not been loaded when Sitebulb took the snapshot.
I actually crawled it one my time after this with a 10 second render timeout, but there was no difference to the 5 second render timeout, which suggests that 5 seconds is sufficient to see all the content on this website.
On another example site, I experimented with not setting a render timeout at all, and crawling the site again with a 5 second timeout. Comparing the two Crawl Maps shows stark differences:
Clearly, this can have a profound impact upon your understanding of the website and its architecture, which underlines why it is very important to set the correct render timeout in order for Sitebulb to see all of the content.
Recommended render timeout
Understanding why the render timeout exists does not actually help us decide what to set it at.#
Although Google have never published anything official about how long they wait for a page to render, most industry experts tend to concur that 5 seconds is generally considered to be 'about right.'
Either way, all this will show you is an approximation of what Google may be seeing. If you want to crawl ALL the content on your site, then you'll need to develop a better understanding of how the content on your website actually renders.
To do this, head to Chrome's DevTools Console. Right click on the page and hit 'Inspect', then select 'Network' from the tabs in the Console, and then reload the page. I've positioned the dock to the right of my screen to demonstrate:
Keep your eye on the waterfall graph that builds, and the timings that are recorded in the summary bar at the bottom:
So we have 3 times recorded here:
- DOMContentLoaded: 727 ms (= 0.727 s)
- Load: 2.42 s
- Finish: 4.24 s
You can find the definitions for 'DOMContentLoaded' and 'Load' from the image above that I took from Justin Briggs' post. The 'Finish' time is exactly that, when the content is fully rendered and any changes or asynchronous scripts have completed.
Bear in mind that so far we've only looked at a single page. To develop a better picture of what's going on, you'd need to check a number of pages/page templates and check the timings for each one.
If you are going to be crawling with the Chrome Crawler, we urge you to experiment further with the render timeout so you can set your Projects up to correctly crawl all your content every time.
Rendering data from Google Tag Manager
Some SEOs utilise Google Tag Manager (GTM) in order to dynamically change on-page elements, either as a full-blown optimisation solution, or as a proof-of-concept to justify budget for 'proper' dev work.
If you are unfamiliar with this, check out Dave Ashworth's post for Organic Digital - How To: Do Dynamic Product Meta Data in Magento Using GTM - which describes how he used GTM to dynamically re-write and localise the titles and meta descriptions for thousands of pages, with impressive results;
Most other crawlers won't be able to pick up the data inserted by GTM, which means they don't allow you to actually audit this data. This is because by default they block tracking scripts, which can have the affect of bloating audit data.
Here at Sitebulb, we have accounted for that too, and actually give you the option to turn this off, so you CAN collect on-page data dynamically inserted or changed using Google Tag Manager.
To do this, when setting up your audit, head over to the 'URL Exclusions' tab on the left hand menu:
Then scroll alllllll the way down to the section entitled 'Block Third Party URLs', then you need to untick the option marked 'Block Ad and Tracking Scripts', which will always be ticked by default;
And then when you go ahead and crawl the site, Sitebulb will correctly extract the GTM-altered meta data. Note that you may need to tweak the render timeout.
Here is what Dave had to say about his experiences using Sitebulb in his auditing workflow:
And there are actually a couple of downsides to crawling with the Chrome Crawler, for example:
- As a direct result of #1 above, crawling with the Chrome Crawler is slower than with the HTML Crawler, particularly if you have set a long render timeout. On some sites, and with some settings, it can end up taking 6-10 X longer to complete.
Obviously the first port of call, you can save time doing discovery work with a thorough briefing with the client or their dev team.
However, whilst it is nice to think that every client briefing would give you this sort of information up front, I know from painful experience that they are not always forthcoming with seemingly obvious details...
Trying a crawl
Ploughing head first into an audit with the HTML Crawler is actually not going to cost you too much time, since even the most 'niche' websites have more than a single URL.
It is certainly worth bearing in mind though, in case you are a set-it-and-forget-it type, or you tend to leave Sitebulb on overnight with a queue of websites to audit... by the morning you'd be bitterly disappointed.
You can also use Google's tools to help you understand how a website is put together. Using Google Chrome, right click anywhere on a web page and choose 'Inspect' to bring up Chrome's DevTools Console.
Then, leave the DevTools Console open and refresh the page. Does the content stay exactly the same, or does it all disappear?
The Roku site, for instance, provides extremely short shrift:
Comparing response vs rendered HTML
This is where you can make use of Sitebulb's unique report: Response vs Render, which is generated automatically whenever you use the Chrome Crawler.
What this does is render the page like normal, then runs a comparison of the rendered HTML against the response HTML (i.e. the 'View Source' HTML). It will check for differences in terms of all the important SEO elements:
- Meta robots
- Page title
- Meta description
- Internal links
- External links
For the most comprehensive understanding of how this report works, check out our response vs render comparison guide.
When working with any new or unfamiliar website, part of your initial process involves discovery - what type of platform are they on, what kind of tracking/analytics are they using, how big is the website etc...
But also, knowing this could help you unpick issues with crawling or indexing, or affect how you tackle things like internal link optimisation.
A simple workflow could look like this:
- Run and exploratory Sitebulb audit using the Chrome Crawler
- Include the results of this in your audit, and make a decision for future audits as to whether the Chrome Crawler is needed or not.
If you need further convincing that this is a good idea, just ask yourself 'what would Aleyda do...?
- The SEO's Introduction to Rendering by Jamie Alberico
- Rendering on the Web – The SEO Version by Jan-Willem Bobbink
- "Rendering SEO" with Martin Splitt by Onely (Webinar)
- What We Do in the Shadow DOM by Jamie Alberico
Get the most important SEO updates
Sign up & we'll plant a tree
A curated roundup of only the most important and inspiring technical SEO updates in your inbox each month.