How to Crawl JavaScript Websites

How to Crawl JavaScript Websites

Crawling websites in 2017 is not quite as straightforward as it was a few years ago, and this is mainly due to the rise in usage of JavaScript frameworks, such as Angular, React and Meteor.

Traditionally, a crawler would work by extracting data from static HTML code, and up until recently, most websites you would encounter could be crawled in this manner.

However, if you try to crawl a website built with Angular in this manner, you won't get very far (literally). In order to 'see' the HTML of a web page (and the content and links within it), the crawler needs to process all the code on the page and actually render the content.

Trying to Crawl a JavaScript Website Without Rendering

A friend of mine runs a website built in Backbone, and his website provides a great example to see what's going on.

Consider the product page for their most popular product. With Chrome -> Inspect we can see the h1 on the page:

See h1 inspect element

However, if we just view-source on this page, there is no h1 in sight:

View Source no h1

Scrolling further down the view-source page will just show you a bunch of scripts and some fallback text. You simply can't see the meat and bones of the page - the product images, description, technical spec, video, and most importantly, links to other pages.

So if you tried to crawl this website in the traditional manner, all of the data a crawler would usually extract is essentially invisible to it. And this is what you would get:

One page. And one page with very little on it, at that.

So websites like this need to be handled differently. Instead of simply downloading and parsing a HTML file, the crawler essentially needs to build up the page as a browser would do for a normal user, letting all the content get rendered, firing all the JavaScript to bring in dynamic content.

In essence, the crawler needs to pretend to be a browser, let all the content load, and only then go and get the HTML to parse.

This is why you need a modern crawler, such as Sitebulb, set in JavaScript crawling mode, to crawl websites like this.

How to Crawl JavaScript Websites with Sitebulb

Every time you set up a new Project in Sitebulb, you need to choose the Analysis settings, such as checking for AMP or calculating page speed scores.

The default crawler setting is the Non-JavaScript Crawler, so you need to use the dropdown to select the JavaScript Crawler.

This will change the crawler type to 'JavaScript Crawler', and a new settings option will appear, the 'Render Timeout.'

Setting Render Timeout

Most people probably won't know what the Render Timeout refers to or how to set it. If you'd prefer not to know, skip the section below and just leave it at the recommended 5 seconds. Otherwise, read on.

What is this Render Timeout?

The render timeout is essentially how long Sitebulb will wait for rendering to complete before taking an 'HTML snapshot' of each web page.

Justin Briggs published a post last year which is an excellent primer on handling JavaScript content for SEO, which will help us explain where the Render Timeout fits in.

I strongly advise you go and read the whole post, but at the very least, the screenshot below shows the sequence of events that occur when a browser requests a page that is dependant upon JavaScript rendered content:

Sequence of Events

The 'Render Timeout' period used by Sitebulb starts just after #1, the Initial Request. So essentially, the render timeout is the time you need to wait for everything to load and render on the page. Say you have the Render Timeout set to 4 seconds, this means that the each page has 4 seconds for all the content to finish loading and any final changes to take effect.

Anything that changes after these 4 seconds will not be captured and recorded by Sitebulb.

Render Timeout Example

I'll demonstrate with an example, our friends again at Bailey of Sheffield. If I crawl the site with no render timeout at all, I get a total of 30 URLs. If I use the 5 second timeout, I get 51 URLs, almost twice as many.

Different Crawl Render Timeouts

(The Audit with 1 URL crawled, if you recall, was from crawling with the non-JavaScript crawler).

Digging into a little more detail about these two JavaScript crawls, there were 14 more Internal HTML URLs found with the 5 second timeout. This means that, in the Audit with no render timeout, the content which contains links to those URLs had not been loaded when Sitebulb took the snapshot.

Clearly, this can have a profound impact upon your understanding of the website and its architecture, which can be highlighted by comparing the two crawl maps:

Crawl Map Comparison

In this instance, it was very important to set the Render Timeout in order for Sitebulb to see all of the content.

Recommended Render Timeout

Understanding what the Render Timeout is there for does not actually help us decide what to set it at. We have scoured the web for confirmation from Google about how long they wait for content to load, but we haven't found it anywhere.

What we did find however was that most people seem to concur that 5 seconds is generally considered to be 'about right.' Until we see anything concrete from Google, or have a chance to perform some more tests of our own, we'll be recommending 5 seconds for the Render Timeout.

But all this will show you is an approximation of what Google may be seeing. If you want to crawl ALL the content on your site, then you'll need to develop a better understanding of how the content on your website actually renders.

To do this, we'll return to Chrome's DevTools Console. Right click on the page and hit 'Inspect', then select 'Network' from the tabs in the Console, and then reload the page. I've repositioned the dock to the right of my screen to demonstrate:

Record Network Activity

Keep your eye on the waterfall graph that builds, and the timings that are recorded in the summary bar at the bottom:

Load Timing

So we have 3 times recorded here:

  • DOMContentLoaded: 727 ms (= 0.727 s)
  • Load: 2.42 s
  • Finish: 4.24 s

You can find the definitions for 'DOMContentLoaded' and 'Load' from the image above that I took from Justin Briggs' post. The 'Finish' time is exactly that, when the content is fully rendered and any changes or asynchronous scripts have completed.

If the website content depends on JavaScript changes, then you really need to wait for the 'Finish' time, so use this as a rule of thumb for determining the render timeout. 

Bear in mind that so far we've only looked at a single page. To develop a better picture of what's going on, you'd need to check a number of pages and check the timings for each one. 

If you are going to be crawling with JavaScript, we urge you to experiment further with the render timeout so you can set your Projects up to correctly crawl all your content every time.

Side Effects of Crawling with JavaScript

Almost every website you will ever see uses JavaScript to some degree - interactive elements, pop-ups, analytics codes, dynamic page elements... all controlled by JavaScript.

However, most websites do not employ JavaScript to dynamically alter the majority of the content on a given web page. For websites like this, there is no real benefit in crawling with JavaScript enabled. In fact, in terms of reporting, there is literally no difference at all:

Crawling JavaScript vs Non-JavaScript

And there are actually a couple of downsides to crawling with JavaScript, for example:

  1. Crawling with JavaScript is typically slower than without, particularly if you have set a long Render Timeout. On some sites, and with some settings, it can end up taking 6-10 X longer to complete.
  2. Crawling with JavaScript means you need to fetch and render every single page resource (JavaScript, Images, CSS, etc...) - which is more resource intensive for both your local machine that runs Sitebulb, and the server that the website is hosted on.

So, unless you need to crawl with JavaScript because the website uses a JavaScript framework, or because you specifically want to see how the website responds to a JavaScript crawler, it makes sense to crawl without JavaScript by default.

How to Detect JavaScript Websites

I've used the phrase 'JavaScript Websites' for brevity, where what I actually mean is 'websites that depend on JavaScript-rendered content.'

It is most likely that the type of websites you come across will be using one of the increasing popular JavaScript frameworks, such as:

  • Angular
  • React
  • Embed
  • Backbone
  • Vue
  • Meteor

If you are dealing with a website running one of these frameworks, it is important that you understand as soon as possible that you are dealing with a website that is fundamentally different from a non-JavaScript website.

Client Briefing

Obviously the first port of call, you can save time doing discovery work with a thorough briefing with the client or their dev team.

However, whilst it is nice to think that every client briefing would give you this sort of information up front, I know from painful experience that they are not always forthcoming with seemingly obvious details...

Trying a Crawl

Ploughing head first into an Audit with the Non-JavaScript Crawler is actually not going to cost you too much time, since even the most 'niche' websites have more than a single URL.

Crawled 1 URL

Whilst this would not mean that you're definitely dealing with a JavaScript website, it would be a pretty good indicator.

It is certainly worth bearing in mind though, in case you are a set-it-and-forget-it type, or you tend to leave Sitebulb on overnight with a queue of websites to Audit... by the morning you'd be bitterly disappointed.

Manual Inspection

You can also use Google's tools to help you understand how a website is put together. Using Google Chrome, right click anywhere on a web page and choose 'Inspect' to bring up Chrome's DevTools Console.

Then hit F1 to bring up the Settings. Scroll down to find the Debugger, and tick 'Disable JavaScript.'

Disable JavaScript in Chrome

Then, leave the DevTools Console open and refresh the page. Does the content stay exactly the same, or does it all disappear?

This is what happens in my Bailey of Sheffield example:

Bailey of Sheffield no JavaScript

Notice anything missing?

While this is a pretty obvious example of a website not working with JavaScript disabled, it is also worth bearing in mind that some websites load only a portion of the content in with JavaScript (e.g. an image gallery), so it is often worth checking a number of page templates like this.

What is Google (Probably) Doing?

As we have already noted, it is more resource-intensive to crawl with JavaScript enabled all the time, and in most cases, is not even necessary.

Similarly, crawling with JavaScript is a lot slower than without, particularly when you factor in a Render Timeout.

If we know anything about Google it's that they like to optimize the shit out of whatever they are doing. It seems inconceivable that they would be crawling with JavaScript switched on by default all of the time.

I'd guess it is much more likely that they are performing test crawls (to compare against HTML results) and using various other indicators to determine if there is a need to crawl with JavaScript, and weighing this with factors such as page importance and freshness. To help develop your own opinions on this, I'd start by reading Will Critchlow's excellent Moz post 'Evidence of the Surprising State of JavaScript Indexing.'

Regardless of what Google are actually doing, it is very likely that they are putting effort into determining if it is worthwhile to crawl a given website/webpage with JavaScript enabled.

Our recommendation is that you do the same.

Further Reading

JavaScript SEO is still relatively new and undocumented, however we have put together a list of all the best resources for learning about JavaScript SEO, including guides, experiments and videos. We'll keep the resource post up to date with new publications and developments.

Patrick Hathaway

I am a technical SEO, a father and a part-time bread baker. For Sitebulb, I do the marketing, support and 'customer success' (which, among other things, means writing posts like this).

Ready to try Sitebulb?
Start your free 14 day trial now

Start Free Trial