How to Crawl JavaScript Websites

Patrick Hathaway
Published 03 July 2017

How to Crawl JavaScript Websites

Crawling websites is not quite as straightforward as it was a few years ago, and this is mainly due to the rise in usage of JavaScript frameworks, such as Angular, React and Meteor.

Traditionally, a crawler would work by extracting data from static HTML code, and up until recently, most websites you would encounter could be crawled in this manner.

However, if you try to crawl a website built with Angular in this manner, you won't get very far (literally). In order to 'see' the HTML of a web page (and the content and links within it), the crawler needs to process all the code on the page and actually render the content.

Google handles this in a 2-phase approach. Initially they crawl and index based on the static HTML (the 'first wave' of indexing). Then, once they have resources available in order to render the page, they perform a second wave of indexing based on the rendered HTML.

Two Phase Indexing

Table of contents:

Trying to Crawl a JavaScript Website Without Rendering

We're first going to investigate what this first wave of indexing would look like for a website built in a JavaScript framework.

A friend of mine runs a website built in Backbone, and his website provides a great example to see what's going on.

Consider the product page for their most popular product. With Chrome -> Inspect we can see the h1 on the page:

See h1 inspect element

However, if we just view-source on this page, there is no h1 in sight:

View Source no h1

Scrolling further down the view-source page will just show you a bunch of scripts and some fallback text. You simply can't see the meat and bones of the page - the product images, description, technical spec, video, and most importantly, links to other pages.

So if you tried to crawl this website in the traditional manner (using the 'HTML Crawler'), all of the data a crawler would usually extract is essentially invisible to it. And this is what you would get:

One page.

If Google came in for their first wave of indexing, all they would find is this one page. And one page with very little on it, at that.

So websites like this need to be handled differently. Instead of simply downloading and parsing a HTML file, the crawler essentially needs to build up the page as a browser would do for a normal user, letting all the content get rendered, firing all the JavaScript to bring in dynamic content.

In essence, the crawler needs to pretend to be a browser, let all the content load, and only then go and get the HTML to parse.

This is why you need a modern crawler, such as Sitebulb, set in Chrome Crawler mode, to crawl websites like this. In other words: a JavaScript crawler.

How to Crawl JavaScript Websites with Sitebulb

Every time you set up a new Project in Sitebulb, you need to choose the Analysis settings, such as checking for AMP or calculating page speed scores.

The default crawler setting is the HTML Crawler, so you need to use the dropdown to select the Chrome Crawler. In some cases, Sitebulb will detect that the site is using a JavaScript framework, and will warn you to use the Chrome Crawler (and it will pre-select it for you, like the image below).

Chrome Crawler

In these cases, a new settings option will appear, the 'Render Timeout.'

Most people probably won't know what the Render Timeout refers to or how to set it. If you'd prefer not to know, skip the section below and just leave it at the recommended 5 seconds. Otherwise, read on.

What is this Render Timeout?

The render timeout is essentially how long Sitebulb will wait for rendering to complete before taking an 'HTML snapshot' of each web page.

Justin Briggs published a post which is an excellent primer on handling JavaScript content for SEO, which will help us explain where the Render Timeout fits in.

I strongly advise you go and read the whole post, but at the very least, the screenshot below shows the sequence of events that occur when a browser requests a page that is dependant upon JavaScript rendered content:

Sequence of Events

The 'Render Timeout' period used by Sitebulb starts just after #1, the Initial Request. So essentially, the render timeout is the time you need to wait for everything to load and render on the page. Say you have the Render Timeout set to 4 seconds, this means that the each page has 4 seconds for all the content to finish loading and any final changes to take effect.

Anything that changes after these 4 seconds will not be captured and recorded by Sitebulb.

Render Timeout Example

I'll demonstrate with an example, our friends again at Bailey of Sheffield. If I crawl the site with no render timeout at all, I get a total of 30 URLs. If I use the 5 second timeout, I get 51 URLs, almost twice as many.

Different Crawl Render Timeouts

(The Audit with 1 URL crawled, if you recall, was from crawling with the HTML Crawler).

Digging into a little more detail about these two Chrome Crawls, there were 14 more Internal HTML URLs found with the 5 second timeout. This means that, in the Audit with no render timeout, the content which contains links to those URLs had not been loaded when Sitebulb took the snapshot.

Clearly, this can have a profound impact upon your understanding of the website and its architecture, which can be highlighted by comparing the two crawl maps:

Crawl Map Comparison

In this instance, it was very important to set the Render Timeout in order for Sitebulb to see all of the content.

Understanding why the Render Timeout exists does not actually help us decide what to set it at. We have scoured the web for confirmation from Google about how long they wait for content to load, but we haven't found it anywhere.

What we did find however was that most people seem to concur that 5 seconds is generally considered to be 'about right.' Until we see anything concrete from Google, or have a chance to perform some more tests of our own, we'll be recommending 5 seconds for the Render Timeout.

But all this will show you is an approximation of what Google may be seeing. If you want to crawl ALL the content on your site, then you'll need to develop a better understanding of how the content on your website actually renders.

To do this, we'll return to Chrome's DevTools Console. Right click on the page and hit 'Inspect', then select 'Network' from the tabs in the Console, and then reload the page. I've positioned the dock to the right of my screen to demonstrate:

Record Network Activity

Keep your eye on the waterfall graph that builds, and the timings that are recorded in the summary bar at the bottom:

Load Timing

So we have 3 times recorded here:

  • DOMContentLoaded: 727 ms (= 0.727 s)
  • Load: 2.42 s
  • Finish: 4.24 s

You can find the definitions for 'DOMContentLoaded' and 'Load' from the image above that I took from Justin Briggs' post. The 'Finish' time is exactly that, when the content is fully rendered and any changes or asynchronous scripts have completed.

If the website content depends on JavaScript changes, then you really need to wait for the 'Finish' time, so use this as a rule of thumb for determining the render timeout.

Bear in mind that so far we've only looked at a single page. To develop a better picture of what's going on, you'd need to check a number of pages/page templates and check the timings for each one.

If you are going to be crawling with the Chrome Crawler, we urge you to experiment further with the render timeout so you can set your Projects up to correctly crawl all your content every time.

Side Effects of Crawling with JavaScript

Almost every website you will ever see uses JavaScript to some degree - interactive elements, pop-ups, analytics codes, dynamic page elements... all controlled by JavaScript.

However, most websites do not employ JavaScript to dynamically alter the majority of the content on a given web page. For websites like this, there is no real benefit in crawling with JavaScript enabled. In fact, in terms of reporting, there is literally no difference at all:

Chrome HTML Crawler

And there are actually a couple of downsides to crawling with the Chrome Crawler, for example:

  1. Crawling with the Chrome Crawler means you need to fetch and render every single page resource (JavaScript, Images, CSS, etc...) - which is more resource intensive for both your local machine that runs Sitebulb, and the server that the website is hosted on.
  2. As a direct result of #1 above, crawling with the Chrome Crawler is slower than with the HTML Crawler, particularly if you have set a long Render Timeout. On some sites, and with some settings, it can end up taking 6-10 X longer to complete.

So, unless you need to crawl with the Chrome Crawler because the website uses a JavaScript framework, or because you specifically want to see how the website responds to a JavaScript crawler, it makes sense to crawl with the HTML Crawler by default.

How to Detect JavaScript Websites

I've used the phrase 'JavaScript Websites' for brevity, where what I actually mean is 'websites that depend on JavaScript-rendered content.'

It is most likely that the type of websites you come across will be using one of the increasing popular JavaScript frameworks, such as:

  • Angular
  • React
  • Embed
  • Backbone
  • Vue
  • Meteor

If you are dealing with a website running one of these frameworks, it is important that you understand as soon as possible that you are dealing with a website that is fundamentally different from a non-JavaScript website.

Client Briefing

Obviously the first port of call, you can save time doing discovery work with a thorough briefing with the client or their dev team.

However, whilst it is nice to think that every client briefing would give you this sort of information up front, I know from painful experience that they are not always forthcoming with seemingly obvious details...

Trying a Crawl

Ploughing head first into an Audit with the Chrome Crawler is actually not going to cost you too much time, since even the most 'niche' websites have more than a single URL.

Crawled 1 URL

Whilst this would not mean that you're definitely dealing with a JavaScript website, it would be a pretty good indicator.

It is certainly worth bearing in mind though, in case you are a set-it-and-forget-it type, or you tend to leave Sitebulb on overnight with a queue of websites to Audit... by the morning you'd be bitterly disappointed.

Manual Inspection

You can also use Google's tools to help you understand how a website is put together. Using Google Chrome, right click anywhere on a web page and choose 'Inspect' to bring up Chrome's DevTools Console.

Then hit F1 to bring up the Settings. Scroll down to find the Debugger, and tick 'Disable JavaScript.'

Disable JavaScript in Chrome

Then, leave the DevTools Console open and refresh the page. Does the content stay exactly the same, or does it all disappear?

This is what happens in my Bailey of Sheffield example:

Bailey of Sheffield no JavaScript

Notice anything missing?

While this is a pretty obvious example of a website not working with JavaScript disabled, it is also worth bearing in mind that some websites load only a portion of the content in with JavaScript (e.g. an image gallery), so it is often worth checking a number of page templates like this.

Further Reading

JavaScript SEO is still relatively new and undocumented, however we have put together a list of all the best resources for learning about JavaScript SEO, including guides, experiments and videos. We'll keep the resource post up to date with new publications and developments.

Patrick Hathaway

Patrick spends most of his time trying to keep the documentation up to speed with Gareth's non-stop development. When he's not doing that, he can usually be found abusing Sitebulb customers in his beloved release notes.

Free 14 day trial.
Full, unrestricted access.
No credit card required.

Try Sitebulb for Free