Sitebulb Release Notes

Transparent Release Notes for every Sitebulb update. Critically acclaimed by some people on Twitter. May contain NUTS.

Sitebulb Latest Version: 4.4

Version 4.4

Released on 23rd October 2020

Updates

#1 Sitebulb now detects and reports on JavaScript links

This is what happens when two powerhouses of the international SEO scene get together and basically challenge Gareth to figure something out. Thanks Cindy for the awesome idea, and Arnout for the assist. 

Arnout Cindy

Sidenote: we should all reluctantly credit Gareth for actually doing the work.

So what the bloody hell is Cindy banging on about, and what fix have you built, I hear no one asking. Well, this is all to do with webpages that alter the HTML during rendering, and specifically ones that change links.

I know what you're thinking - don't mess with my links.

Mercifully, most sites don't do this, but when you find a site that does do this, there's literally no way to find links that have been altered by JavaScript, at scale. Until now.

When you crawl with the Chrome Crawler, Sitebulb will render the DOM and parse the rendered HTML, as normal. Now, it will also go and grab the source HTML (as in, before JavaScript might have changed them) and compare all the links between the two.

This allows Sitebulb to identify all links that have been affected by JavaScript, and surface these via a new column in the Link Explorer:

JavaScript Links in link explorer

The options we have for this column are as follows:

  • Created - the link was not found in the response HTML, so it appears that JavaScript created it.
  • Modified - the link was found in the response HTML, however JavaScript has modified either the anchor text or the href URL.
  • No - not added or altered by JavaScript at all, move along.  

So the Link Explorer allows you to interrogate the data en masse, and if you want to zone in on particular URLs or particular links, you can also see the data listed on the URL Details page. For example:

URL Details View

Independent of any audit, if you want to evaluate a single page, this data is also available in the Single Page Analyser:

Single Page Analyser - JavaScript links

#2 Two new redirect Hints added 

As a result of more gentle nudging (no, that's not a euphemism. Wash your mind out!) from Cindy Krum, we added a couple more redirect hints:

You know the drill. That twat developer builds a new footer, and includes a link without a trailing slash, or with a random capital letter. The server automatically redirects, so suddenly you have redirected links on every page on the site.

These 2 new Hints just make it easier to isolate these specific redirects issue, which, if widespread (i.e. template based), can be an easy fix that makes a big difference.

Bugs

  • A few things looked a bit shit in James Bond mode. In one case, the titles on a report were completely invisible! That's what happening when you put black text on a black background, it turns out.
  • We added support for crawling subdomains back in v4.0, but did not think through how self-referencing canonicals would be handled on said subdomain URLs. Sitebulb was claiming they were canonicalized to an external URL! Doh!
  • Users were seeing an error "SitemapIndex or UrlSET Nodes Missing" which was due to the sitemap URL taking a long bloody time. We increased the default timeout for XML sitemaps. Page speed is a thing, people.
  • When you deleted an audit from a project, the project size was not actually updating in the tool. Sehr frustrating.
  • The 'Images missing alt text' Hint was itself going missing, when you re-audited a project from pre-4.3. We looked into it, and it turns out the Hint was just misbehaving. It has been rumoured among the other Hints that it's been bragging about being 'so meta'. What a punk.
  • Sitebulb's clever 'entity merging' improvement from v4.1 was a little bit too clever at times, and ended up over-reporting errors. We made it a bit dumber again, lest it get ideas above its station.
  • Cleaned up a few more issues being caused by Sitebulb failing to handle x-default properly on sites with hreflang. It was screwing up a few data points, and the fix was boring and complicated (users who reported the error will be contacted individually).
  • If you changed your default save location (a la 'Managing file space for your Projects') and then imported an audit, it would not save to the new location, but still the old one. It works properly now.

Version 4.3

Released on 23rd September 2020

Ladies and gentlemen, I give you Gareth 'I'm just going to do a couple of bugs' Brown. 25 tickets later...

Fortunately, most of them related to single site issues only, so you won't find them reported here - but folks that took the time to report the issues, expect an email in your inbox shortly.

Updates

  • We have rebuilt a number of the images tables, including fans favourite 'Images missing alt text'.
  • This one was my idea. In the Structured Data report, we made the Warnings button orange instead of red, because orange universally means 'not so bad' whereas 'red' is understood to mean 'really really bad.' Some pundits have commented that this makes the tool 'exponentially better' (yes, this was me). 
  • We actually have Sir Dom of Disney to thank for this next improvement: Scroll to top of the list when clicking between paginated pages in a URL List. Yes, I think I agree with you dear reader, marginal at best. It's no orange button!
  • We removed the CSS linter, as it was doing more harm than good in terms of performance. As a result we had to remove all CSS based Hints from the Front-end checks. No bugger looks at those things anyway!
  • Updated the Schema.org validation model to version 10, in line with their recent update.
  • We added better error handling to the standalone structured data tool, so if a page 404s, it will tell you that it 404s rather than 'we didn't find any structured data'.

Bugs

  • Fixed a bug which meant the Code Coverage 'View Wastage' button was literally not working at all (just spinny spinner of death). Only 1 person reported this problem though... do you guys even optimize?
  • Last update (v4.2) we added file size to all Projects and Audits, to make data management easier on your machine. Some folks were however seeing this displayed as 'N/A', which is about as useful as see-through wrapping paper.
  • I wrote a delightful documentation article entitled 'Managing file space for your Projects' (it's a doozy), in which I heartily encouraged y'all to delete Projects or Audits you no longer needed. Post-publish, we *ahem* realised that deleting in this manner actually did NOT clear up the space on your hard drive. So if you followed my sage instructions... you might wanna do it again.
  • This Hint was misfiring: 'Style sheets are loaded after JavaScript resources in the <head>'. If I don't elaborate on this any further, I'm sure you won't just assume that we were searching for the style sheets before and not after the JS. So I shan't elaborate further.
  • Fixed a bug with one of the Structured Data pie charts, which, when you clicked through to the URL List, did not show the blue Structured Data button to jump straight into the single URL details. Inexcusable.
  • Fixed an annoying UI bug which meant you could only add 50 columns at a time to URL Lists, with the Add/Remove columns function. Now you can add ALL THE COLUMNS! as intended.
  • Improved parsing of microdata for deep nested properties, which was causing Sitebulb to mis-report some structured data (don't worry, edge case stuff).

Version 4.2

Released on 16th September 2020

This is the sort of update that newer Sitebulb users will be like 'meh' but older Sitebulb users will be like 'ohmygodthisisgreat.'

New users, take a walk. Old users, enjoy...

Ways and means to manage all the data that Sitebulb creates

Website audits create a LOT of data. Especially big ones.

And especially when you've kept every single audit, for every single client, for 3 years - without ever even thinking about going back and deleting some old audits. Sound familiar? Are you me?

This update is focused on solutions to the 'ain't got no hard drive space left' conundrum. I have actually written a thrilling chronicle on this self-same topic in our documentation area, entitled Managing file space for your Projects.

Since I'm on strict orders NOT to swear in the docs, we'll recount the changes here as well, with a more relaxed vibe ('I'm basically a chilled out entertainer.').

#1 Project list now shows file size of the data

Your Project list now shows the data size of the Project, so you can easily scan down the list of Projects to find the biggest ones:

Project data size

Mind = blown, right?

So this is the easiest solution to the 'size' issue. Just go and delete the biggest fucking Projects and be done with this whole charade!

But ok, we get it, that client who fired you in 2018 is definitely going to come back from your competitor, they're bound to be receiving terrible service. It is written.

So it's super important you keep all your Projects. BUT do you need to keep all your Audits?

#2 Project page now shows file size for Audits

We've only gone and added the file size there too! Now you can see where all the data storage is coming from, and delete all the worst offenders. When it comes to <CLIENT WHO IS DEFINITELY COMING BACK SOON>, why not just keep the most recent audit, and delete all the rest?

Delete audits

Why not indeed? Because some people dislike making the right choice, and would rather just keep everything forever (See also: Trump supporters).

Ok, well for you lot, we've got something even more exciting...

#3 Change the default save location

If you think I've buried the lede, well, I apologise - I just like to create a bit of suspense.

People actually asked us to do this one, and everything.

Custom Folder to save audits

PREFACE: Sitebulb is a desktop tool, which means data is saved to your local machine (some people automatically assume Sitebulb saves data to the cloud, likely because it is so gosh-darn beautiful).

Here's the skinny - Sitebulb is set up to save all Audit data to the same relative folder, on every machine. This default save location uses this directory:

  • Windows: C:\Users\<USERNAME>\AppData\Local\Sitebulb
  • Mac: /Users/<USERNAME>/.local/share

If your main drive is becoming full and you are looking to save space, one option may be to save data to a different partitioned drive, or an external hard drive.

Also, you may wish to just do this anyway - nothing to do with saving space. You'd just rather have all your stuff in a particular place. This works for that philosophy also.

In order to change the default folder, you need to adjust the 'global settings' that govern how Sitebulb runs by default for every Project.

From the top navigation menu, choose the 'Settings' option.

Access Global Settings

From here, the 'Project Data' tab allows you to control where Sitebulb stores your audit data, on your computer.

Project Data

In order to change the default file location, simply click the blue button, choose the appropriate directory, then hit Save. Note that this does not move pre-existing Project/Audit data, but will become the new default location for all new Projects you add.

Change Directory

I expect there are literally handfuls of people getting all hot and bothered as they read this, anxious to try it out. Before you do, take a moment to consider the intentions of this setting and also the devices we have tested it on (with resounding success):

  • Different folders on the same hard drive
  • Shiny new SDD external hard drive
  • Shitty old external hard drive
  • USB flashdrive(!)
  • Dropbox

Is has not been tested on a network drive - and we do not recommend you use this on a network drive. Sitebulb has to perform many thousands of read/write actions in short order - if you subject it to your ponderous shared network drive, I expect you'll be sorry.

#4 Move existing Projects into a different directory

The option above allows you to change where Audit data is stored for new Projects. However, you may have a particularly large Project that is already stored in the original default save location, which, to reasons unknown to you, your client does not wish you to delete (let us never question their enduring wisdom).

In this situation, the best option might be to move the Project to a completely new location - such as an external hard drive. This way you can get it out of off your hard drive until such a point that your client realises they no longer need it (or realises they no longer need you...).

To make the magic happen, go to the Projects page and click the Move Project Data button:

Move Project Data

This will open an overlay window that allows you to choose a new directory for the Project data to be stored. Again, this can be anywhere that your computer has access to at the time - a partitioned drive, an external drive, or a 'smart workspace' like Dropbox.

Move Project choose directory

This may take a few minutes, so do not move away from the page. Just let Sitebulb do it's thing.

Now, if you are moving a particularly large audit, and/or moving it onto a particularly shitty piece of hardware, perhaps don't do this right before you need to use Sitebulb to produce another client audit that was actually due yesterday. If I see support requests of this nature, don't think I won't send a passive aggressive reply and post subtweets that are only meaningful to me.

Once your Project has been moved, you will see a notice on the Project page for the new save location of this audit.

Project new location

A byproduct of this process that is not particularly intuitive is that any new Audits you run within this Project will also save to the new location.

Unfortunately, I need to add a small addendum for the Mac users among you.

If you move a Project on the Mac, before you are able to open any of the Audits within that Project, you will need to restart Sitebulb, and then your Mac will ask permission when you try to open one of the Audits:

Macs are shit

And just so we're clear, this is only a minor inconvenience. I don't think it's fair to conclude that Macs are shit and should be immediately replaced with a (far superior) Windows machine.

To recap;

  • You can now easily see the data size for Projects and Audits - delete some damn data already!
  • You can now change the save location for all Audit data, moving forwards.
  • You can now move existing Projects into other locations, such as an external hard drive.
  • You should throw your Mac in the bin and buy a Windows machine instead.

Version 4.1

Released on 27th August 2020

Improved the Structured Data analysis

Since we launched v4 we've had tons of great feedback from you beautiful people, in particular regarding the structured data feature.

Nice tweet

But we've also had plenty of ideas and suggestions for how we can make it better.

Unlike normal, most of these were not completely shit ideas from people who want us to build a fix for their one specific use case that only ever affects one website, and is not even really for a client site it's actually just their mum's friend's site that they agreed to help out on and now massively regret it.

In particular, we received excellent advice from Dave Ojeda and Tony McCreath, who both took significant time to let us know how they think we can make the tool more complete. We sincerely thank you both.

The BIG thing that we've done revolves around unique entity identification - Sitebulb will now identify unique entities. And it will merge entities that use the same @id, into a single entity.

These are non-trivial concept, so I'll give you a quick bullet point for each, then explain in more detail below;

  1. Unique entity identification - Sitebulb now understands that 500 instances where an entity is referenced does not necessarily mean there are 500 unique entities. There could be 500 references to a single entity, for example. Sitebulb can now accurately pick this apart.
  2. Merge entities - Sitebulb will now stitch together references to the same unique entity, even if they use separate bits of markup on the same page.

#1 Unique entity identification
This will be easier to explain with an example. We'll use une pomme you know and love who's unafraid to step in... apple.com.

On a sample audit, this is what the old version would show you:

Organization on old version

~1,900 different 'organization' entities? That sounds like a lot of organizations... for a single organization like Apple.

This is what you see on the new version:

2 organization entities

Now we're getting somewhere. Just 2 organization entities, referenced on ~1900 instances.

Clicking the blue '2' lets us see these:

2 Apple Organizations

This opens up a filtered view of the Entity Explorer, and we can easily see the two different organization entities, identified by the two different @ids.

Aside: if you are unfamiliar with the @id property, this is what's known as a 'node identifier', and it can be used to reference specific schemas without needlessly repeating data. 

@id example

Note that if the @id property is NOT present, Sitebulb will still try to identify unique entities, it is just slightly less good at it.

Finally, note that for existing audits with Structured Data, the entities will all be listed as '0' - you'll need to re-audit the site in order to see the unique entity data.

#2 Merging entities
In terms of merging entities, this comes into play when you have data regarding a single entity loaded onto the page in more than one place (e.g. from two different sources).

Previously, Sitebulb would see this as two separate, disconnected entities. Now, if it can determine that the markup is referencing the same entity, it will merge the data into a single item.

On lots of sites this won't be the case, so you won't have experienced this as a deficiency in Sitebulb's reporting. Tony McCreath alerted us to this happening on his client site Looma's (online cake delivery!!).

Let's use their macarons page as an example:

Macarons NOT macaroons

Aside: for many years I thought these used to be called macaroons and someone just changed the name one day. Turns out macarons were always macarons, and macaroons are just a completely different biscuity-thing.

This is how Tony described the review markup on this page:

"This is an interesting page as Yotpo dynamically add reviews and then we rewrite what they add so that the data links up correctly. Sitebulb shows two separate products. In reality these are separate bits of markup representing the same Product. This is done via the use of a common @id."

Previously, Sitebulb was treating this as two separate products, and flagging a bunch of errors on the product reviews portion as it looked like an incomplete data set.

The product portion looked like this (note the highlighted @id):

Product portion

And the reviews portion looked like this (note the same @id and a pile of errors):

Reviews bit

To solve this, we now merge the data on the @id, which forms the complete product. This means that Sitebulb no longer thinks that required properties are missing, which in turn means that the errors are not flagged.

So in 4.1 this same page shows as one complete product, with no errors:

One product

We even now pull out the @id (where present) to show the unique identifier more clearly. As well as merging two or more entities, Sitebulb will also build up any nested entities by @id too.

Improved 'single page view' for structured data

Either using the standalone structured data checker tool (which checks a single URL) or drilling down into the URL details of an audit will give you a single page structured data analysis.

Much like Google's old Structured Data Testing Tool. Except neither discontinued nor completely shit.

Previously, this provided one single view, but now we have split it apart into 3 tabs to show you what is really going on with the structured data.

The first tab shows all the Google Search Features found:

Search Features Found

The second tab shows the root Schema entities:

Schema Entities

The third tab shows all the individual entities:

Individual entities

These views give you the flexibility to dig into the data as you please.

Other small improvements to structured data

A few other small things we changed regarding the structured data feature:

  • Updated how the tool parses certain nested objects.
  • Relaxed the rules around PostalAddress as we had set them to be too harsh, and errors were being flagged when they should have been warnings.
  • The expected type checking has been relaxed to fall in line with Schema.org's guidelines. An expected type can now also be text or a URL.

Added 'bulk upload' to content search feature

In our v4 update, we added 'Content Search', which allows you to specify a word or phrase for Sitebulb to search for on every page of the site.

Of course, lazy bastard SEOs came back to us with 'but what if I want to search for lots of words at once, surely I don't need to add them in one by one? SURELY?!'

Well, actually, you did need to, but now you don't. Just for you, Shirley, we added this button:

Add Multiple Rules

And it works exactly like anyone with a brain would expect it to work. Write your words/phrases, one per line, or just copy/paste the living daylights out of it.

Copy paste multiple searches phrases

When the report is complete, each rule will display as if you had entered them one by one:

Bulk Content Search

And yes, before you ask, you can just dump thousands of words in at once. Note that if you do this, the best way to access the data is to use the green Export All Search Data button you see in the image above. You CAN access the data via the URLs tab, but it will only load 50 columns in at a time, so you would need to do a lot of add/removing to see what you want. Just use the export already.

XML Sitemap build from URL List

Every so often, a customer sends through a question like, 'hey how do I do this one thing please?' and it turns out they are asking for something that Sitebulb doesn't do, and it is something that we would never ever in 1000 years have thought to add.

This little update is one of those things.

Here is what this customer wanted to do:

  1. I want to crawl my entire site.
  2. Using the URL Explorer, I want to define a very specific list of pages (e.g. 'product pages').
  3. With a couple of clicks, I want Sitebulb to generate me an XML Sitemap including only those pages.

So this is exactly what we built:

Generate XML Sitemaps

The XML Sitemap builder allows you to include priority and changefreq attributes.

using Cloudflare and SRCSET tags results in a bunch of erroneous/broken crawling.

Bugs

  • As reported by Jono in what might just possibly be the most thorough support ticket ever, in which he told us both what the problem was and why it was happening, Sitebulb had a bug such that 'using Cloudflare and SRCSET tags results in a bunch of erroneous/broken crawling.'
  • URL Details view would occasionally lock up, on websites with particularly shitty HTML. This was because in the background it was trying to load up the 'Live View', and it was struggling to process the HTML diff (due to the aforementioned dogshit HTML).
  • 'Mark as fixed' would not persist if you closed down Sitebulb then opened it back up again. Whilst this sounds utterly horrendous, in our defense it would only happen if you just ticked 'fixed' and did not enter a description in the description box (so only ~95% of all user instances).
  • On some webpages the magnificent 'point and click' selector for content extraction would work sub-magnificently. That is to say, it would not work at all.
  • If a webpage contained SVGs with titles, Sitebulb would trigger the Hint 'Multiple title tags.' Stupid Sitebulb!

Version 4.0

Released on 3rd August 2020

I'd like to start the v4.0 release notes with an apology. Below you will find *ahem* 4000 words of incomprehensible drivel and mediocre jokes, and it's too many damn pages for any man to understand.

In case you value your time too much for this malarkey, I made you a launch video as well:

Schema + Rich Results Extraction & Validation

So here we are. We FINALLY got the message, y'all wanted us to build this thing (over 300 upvotes on our feature request page).

Feature Upvote

The reason it took us so long is that we are not particularly good at doing things by half. We knew that if we were going to do this, we owed it to ourselves to do it properly.

So we spent AGES on it.

However, being British, it is impossible for me to say that something I was even tangentially involved in is anything more than 'ok'.

I'll just leave this here instead...

Aleyda tweet

Since we're too bashful for self-promotion, we'd love it if you could also share on the socialz to help us out.

This is what Sitebulb can now do for you:

  • Extract structured data marked up with JSON-LD or Microdata format.
  • Validate against Google's Guidelines for the 29 Search Results Features.
  • Validate against Schema.org data model for all other structured data.
  • Aggregate errors so you can pick out template-based problems.

Of course, you'll get some pretty charts to make your client reports look sweet af, and the historical trendlines show changes over time, allowing you to demonstrate improvements as recommendations get implemented (*ahem* or not).

Structured Data Overview

Validation using Google Guidelines

Ok, now for a word on validation. This is how validation has worked for as long as anyone cares to remember:

SEO: "Mr Client sir, I'm afraid your structured data is broken, it fails Google's Rich Results test."
Client: "Hmm, are you sure?"
SEO: "I made you a screenshot, here."
Client: "I just tried it and it passed. What are you talking about?"
SEO (panicky): "Wait, what?? What tool did you use?"
Client (reading): "Google Structured Data Testing Tool, it says here."
SEO: "Oh! Ignore that. It's old, they're getting rid of it anyway."
Client: "But why does it pass one tool and not the other?"
SEO: "That's not important. The important thing is that I'm right."
Client: "I'm patching the developer in. He wrote the code."
Developer (sulking): "Oh, it's him again. What does he want me to do now?"
Client: "He says your structured data is wrong."
Developer: "It is not! I followed Google's guidelines to the letter!"
SEO (smug): "Look, it fails Google's Rich Results test. I have a screenshot and everything."
Developer: "I just checked, it's still showing in the search results. Have we received any warnings in GSC?"
Client: "GSC?"
SEO & Developer: "Webmaster Tools!"
Client: "Oh. No. Not got any warnings."
Developer (satisfied): "See. Nothing to worry about."
SEO: "But. But my screenshot..."

The question consistently boils down to: 'which Google tool is telling us the truth?'

Google themselves deem their documentation as the ultimate source of truth:

John Mueller Tweet

So in Sitebulb, this is what we use. We painstakingly built our own data model, which validates against Google's Guidelines. And when I say we, I mean Gareth.

Schema.org and the changing world of Rich Results - SEOEdinburgh Meetup from Geoff Kennedy

Here's what it looks like:

Search Features Overview

So what this means is that inevitably, Sitebulb will sometimes disagree with Google's testing tools. But that is only because their testing tools disagree with their own guidelines.

Most often the thing we have seen is that Sitebulb will point out that a required property is missing (because Google's docs say it is required), but Google will still give it a 'pass' on the Rich Results test. With that sort of thing, our feelings tend to err on the side of:

  1. Better safe than sorry.
  2. Google have a track record of changing the goalposts without warning. So see #1.

Aggregated Errors

So here's the other problem with structured data validation - issues typically occur on page templates, yet all the reliable testing tools will only take one page at a time.

Identifying and fixing problems at the template level can enable you to resolve problems with a lot of pages all at once.

For this reason, Sitebulb aggregates issues across multiple pages, so you can pick out the specific template error and list all the pages affected.

In the example below, there are some errors with 'Breadcrumb'. Clicking the red See Errors button allows you to...see the errors. In this case they all have the same error - they are missing the required property 'item.'

Issue aggregation

Whilst you probably want to report issues as a template level, you will also probably want to investigate further and pick out some specific examples to send to your clients.

Everything can be drilled down in Sitebulb, so you can check the data in a list:

Or you can explore at a URL level and dig into the markup, with all the issues and errors highlighted for all entities found on the page.

Structured data drilldown

Schema.org Validation

Fixing structured data issues in order to satisfy Google's search feature requirement is typically the number 1 thing that SEOs wish to do, but there is a whole ecosystem outside of Google that can make use of structured data.

For instance, new technology like the Semantic (*mumble*) Connectorising (*mumble*) Graphalizer.

Ok, ok - it's mostly just Google. BUT Google don't let us know when they are adding stuff so better to be ahead of them by sticking to the standard. And some people just like to mark up everything in sight.

This is why Sitebulb also collects and validates all of your Schema.org markup, validating against the published Schema.org documentation.

Schema validation

This also means you can start to use the data in more interesting ways. For example, finding your products with the most reviews or best review ratings. Sitebulb has a built-in entity explorer and property explorer, so you can interrogate the data to unearth interesting opportunities.

Entity Property explorers

Write and fix structured data markup on the fly

As is their wont, Google managed to piss a lot of SEOs off recently, by announcing that their Structured Data testing tool is being shut down.

Barry Adams tweet

You may already be looking for a new method for building and testing your structured data markup in anticipation.

Sitebulb has this covered too.

From the tools menu at the top, choose the Structured Data Checker tool, which will check and validate structured data on any URL or code snippet you enter, allowing you to quickly iterate and re-check as you go.

Structured data checker

That brings us to the end of our structured data feature tour. Oh I nearly forgot to tell you how to switch on the structured data extraction: you tick a single box during the audit setup. Done.

And finally, let me direct you to our documentation area, which will tell you all you need to know about structured data with Sitebulb, and probably more.

Point-and-click Content Extraction

It may also seem like we’re a little late to the party when it comes to content extraction, since practically all the other major crawling tools have it already.

In fact, we were early, developing a 'custom scraper' feature back in 2014, for our first product, URL Profiler. 

And so we know what 5 years of support requests look like for this feature…

  • How do I figure out the CCS selector please?
  • What is the XPath to find the price on these pages?
  • Why doesn’t it work on <INSERT RANDOM AngularJS> website?
  • Who, exactly, is Reg Ex?

And it’s not just the support requests, it’s also all the documentation around these features. Go check out any how-to guide for custom extraction, you’ll see a very familiar intro… 'before we can show you any of the good stuff, let’s have a quick 1000 word primer on the fundamentals of XPath.'

We just figured that THERE HAS TO BE A BETTER BLOODY WAY OF DOING IT THAN THIS.

And there is.

Without further ado, I present to you, Sitebulb’s Content Extraction (AKA 'Custom Extraction in the 21st Century').

Point and click content extraction

Point and click baby.

Our workflow addresses all of the major pain-points that usually accompany this feature:

  1. Load in a test URL into the visual selector window (also works with JavaScript frameworks).
  2. Point and click the element you want to scrape.
  3. Give the datapoint a meaningful name.
  4. Check the 'Test' tab to ensure it is working as you expect.

This means it works on any website you throw at it. It means you don’t need a degree in advanced Regex to figure out the extractor. And it means you don’t need to crawl the website 37 times in order to test your selectors.

The results are presented very clearly for you in the new report 'Content Extraction', with both an overview of the data found:

Content Extraction Overview
And the raw data itself in a URL List:

Content Extraction URL List

As always, you can filter the data or add/remove columns, allowing you to mix and match crawl data with your content extractions.

Advanced Query Data

Advanced usage (Regex and whatnot)

The point-and-click functionality means that anyone can get going with content extraction, but in general we consider content extraction a relatively advanced feature. We are a caring bunch here at Sitebulb, and we were conscious that this feature also needs to cater for the 7 SEOs worldwide who are competent in Regex, so we have some more advanced functions.

We have a complete Advanced Guide in our Documentation, so I won’t cover all the ins and outs here, but I do want to whet your appetite with some tidbits:

  • Regex scraping IS available, but we offer it within the CSS selector path. This means you don’t need to write regex to search the entire document, making it a lot more user friendly.
  • You can also pull out a specific Regex group from your pattern.
  • Use a range of in-built 'operations', like matching the first/last/all items, performing a count, or checking that an element exists.
  • Match on different data types, to automatically extract URLs, phone numbers, email addresses etc... - WITHOUT writing complex Regex patterns - let Sitebulb do the dirty work.
  • Apply specific 'URL Matching' rules, to only perform the content extraction on specific pages. This means, for instance, you can instruct Sitebulb to crawl the entire site, but only grab the pricing info from the /product/ pages. This saves on collecting spurious data you won’t ever use, saves on processing during the crawl itself, and saves on data storage on your hard disk.

At this point I feel a gif to show off some of this stuff is the least you deserve:

Operations example

We'll be publishing a bunch of real-life examples of using this data for good in the not-too-distant future. In the meantime, our docs contain a bunch of examples and screenshots which explain the ins and outs of this feature.

Content Search

The logical bedfellow to content extraction is of course content search. For the uninitiated, this basically means 'search every page on a website for a specific word or phrase.'

Such unsuspecting cherubs may also fail to appreciate the value of such a feature, so let me illuminate. The results of the search allow you to then filter pages based on whether or not they contain certain words.

For example:

  • Check if ecommerce product pages contain 'out of stock' messaging.
  • Check which pages reference a particular brand name or company name.
  • Understand which pages mention certain target keywords (for building internal links).

For a basic search, all you need to do is enter the text and hit 'Add Rule', and that's all there is to it.

Enter text to search

Once you've added your rule, you can stop there, or just keep adding more rules. You will see all your rules in the audit setup page, ready for you to start the audit.

For example, if we wanted to crawl our site and understand how often we reference Sitebulb as a 'crawler' vs a 'website auditor', we could set it up like this:

Content Search Options

There is no limit to the number of rules you can add, so collect all the data you need.

Once your audit is complete, you can access the data report using the left hand menu.

The Overview will show you details of the data totals for each different search phrase:

Content search overview

The two data columns tell you slightly different things:

  • Total Found = the total number of instances that Sitebulb found the phrase, even if some of them were on the same page.
  • Found on URLs = the number of unique URLs that Sitebulb found the phrase on.

Without even analysing the data in detail we can already see that 'crawler' is dominant. 

To see the detail of specific URLs, we need to switch to the URLs tab, which shows the URLs alongside columns headed by the text/phrase filters. The numbers in each cell relate to how many instances of the phrase were found on each page.

Sort URL List Data

As always with URL Lists, you can add or remove columns so that you can easily combine technical crawl data with your extracted data. You can also create filters on the data to gain additional insights.

Grouping phrases

Those with a little more experience might be looking for something a little more risqué, and this is where our Advanced Content Search comes into play.

The concept is relatively straightforward, we are replacing 'word/phrase' with a combination of words to search for.

Instead of restricting yourself to only one word, why not try two or three?

Let's work through an example. Imagine we are auditing a travel website. We want to identify pages that talk about specific winter sports, so we could set it up like this:

Winter sports

Once this rule is applied, Sitebulb would search for any pages that contain either 'skiing', 'snowboarding' or 'ice skating' (or any combination of the three). Verifiably polyamorous.

BTW, the requirement to provide a 'Rule Name' is simply to make it easier to view the results in the report:

Advanced Results

In this case, the numbers returned in the 'Winter Sports' column reflect the total number of matches. So a result of '6' might mean that 'skiing' is mentioned 4 times, 'snowboarding' 2 times and 'ice skating' not at all.

Now, imagine we wanted to identify pages that talk about specific winter sports, but only for certain countries. We could rule out specific countries by adding them in the right hand 'does not contain' box, e.g.

Winter sports not europe

Once this rule is applied, Sitebulb would search for any pages that contain either 'skiing', 'snowboarding' or 'ice skating' (or any combination of the three) AND ALSO contain none of 'france', 'spain', 'italy' and 'austria.'

What this does is surface the pages about USA/Canada instead of Europe, as we wanted:

Canada USA winter sports

Using this combination approach allows you to do things like categorise pages based on topic, or group them based on a set of target keywords - which could then be used for content audits or internal linking strategies.

Updated Chromium Crawler

As per our commitment to maintaining our Chrome Crawler to be evergreen, like the Googlebot crawler, we have updated it to use the latest stable version of Chromium, version 85 (85.0.4182.0, to be precise).

Moved Code Coverage/Accessibility

With all of this new stuff we've added, we were running out of real estate on our audit setup page, it was as crowded as a crappy British beach on the only sunny day in lockdown. So, we moved some stuff around, which you might not notice if you don't read the next couple of paragraphs. Or, most likely, even if you do.

The 2 audit options we have moved are:

  • Code Coverage
  • Accessibility

They can now be found in Advanced Settings:

Additional Audit Options

IMPORTANT NOTE: You will only find them in there when you have the Chrome Crawler selected. Otherwise you'll just be presented with a greyed out panel.

Crawl Subdomains

When we first built Sitebulb, we set out our stall to be a 'one website at a time' type of SEO tool. Other crawler tools take a different approach, but we were happy to be different. We would proudly declare that you can only ever crawl one subdomain at a time with Sitebulb, as each different subdomain should be considered a website in and of itself.

We stubbornly clung to this philosophy, even when it was becoming abundantly clear that we were wrong.

Sometimes you genuinely do encounter a site that has been set up with subdomains so tangled and deeply intertwined in the architecture that it becomes impossible to know where one ends and the other begins. Like institutional racism.

But wrong we were, and so fix it we have. You can now navigate to advanced settings and set up Sitebulb to crawl subdomains.

In Advanced Settings, the Subdomains section is under Crawler -> Subdomains

Advanced Settings - Subdomains

What you need to do is tick the second box here: 'Treat Subdomains as Internal' - this means that any subdomain URLs that Sitebulb encounters will be treated as internal URLs. On other words, ticking this box is 'how to crawl all subdomain URLs on a website'. Sitebulb will run all the normal 'internal checks' on these URLs, including internal link calculations.

When you do tick the box, you will be now faced with the option of explicitly adding the subdomains, one per line in the box underneath.

If you tick the box and don't add any subdomains into the box, ALL subdomain URLs will be crawled. However if you do add some subdomains to the box, ONLY those subdomains will be crawled.

Simply enter the subdomains, one per line. In the example below, Sitebulb would crawl https://blog.example.com and https://support.example.com and treat those subdomains as internal (but NOT https://community.example.com or any other subdomain).

Enter subdomains

When you come to look at the completed audit results, you will see that URLs from subdomains will now appear in the 'Internal URLs' lists, and they will have URL Rank (UR) scores.

Subdomains in internal

Similarity % added to Similar Content check

We know we do a lot of updates. Some users even say it is annoying. But we are (normally) giving you lots of new stuff to play with. Anyway, because of this, sometimes we fear that stuff gets missed.

Take 'Similar Content' (AKA 'Near Duplicate Content') - we added this way back in October last year, but it feels like nobody noticed it.

But I'm not mad about it.

Instead, we'll just focus your attention by improving the feature, giving me another opportunity to get out my drum and bang on about it again.

You'll find Similar Content in the Duplicate Content graph (where it has always been), and also in the Duplicate Content Hints.

Similar Content graph

Now, when you click through from this chart to view the data, you'll see a new column 'Closest Similarity.' This metric allows you to understand where you have the biggest 'problems' in terms of similar content.

Closest SImilarly added

Figuring out which specific URLs are similar is also a piece of cake, just click the URL Details and view the Duplicate Content tab.

Similar Content tab

As always, you can just export the bulk data to CSV or to Google Sheets with a press of a button, if you'd rather analyse it in a spreadsheet.

Since I finally have your attention in this regard (I'm not mad about it), I thought I might share some technical details about how Sitebulb calculates similar content.

Firstly, similar content is flagged as standard, it's not a configuration option you need to turn on. Secondly, you don't need to worry about specifying content areas - Sitebulb already 'dechromes' each page of its navigational elements, then minhashes the remaining content. This means you don't need to worry about having different page templates and specifying the content area for each (which we feel is likely similar to how a search engine would handle the content).

Finally, the new similarity figure is detected using a Jaccard similarity algorithm. Here's some further reading on the topic if you would be so inclined.

Although I'm sure that you won't be. I expect I lost you at 'updates.'

But I'm not that mad about it.

Mark Hints as Fixed

Lots of users like to treat Sitebulb's Hints as a checklist, before diving deeper into other issues.

To aid such a process, we have added the option to mark a given Hint as 'Fixed.' Now, it need not literally be fixed, you could check it off simply because you have already investigated it and deemed it not a problem.

Frankly, you can do with it as you damn well like.

Now, every Hint has a 'Mark as Fixed' button on the right hand side:

Mark as fixed

If you wish to mark a Hint as fixed, simply click that button, optionally add a comment, and then 'Tick':

Mark Hint as fixed

You will see a green 'Fixed' box appear in the top right hand corner of the Hint. Hover over this green box in order to see the comment.

Moving forwards, this Hint will remain 'Fixed' within this Audit. So if you navigate into a different Audit or Project, and return to this Audit, the Hint will still be 'Fixed'.

Similarly, in other areas of the Audit, such as the 'All Hints' page, the 'Fixed' designation will carry through, for any Hints you have marked as fixed.

All Hints - Hint fixed

There are two important things to note regarding this functionality:

  1. Marking a Hint as 'Fixed' does not alter any of the audit scores. It is designed simply to be a mechanism for marking items as done.
  2. The 'Fixed' designation only applies to this single Audit. If you run future Audits on the same website, none of the Hints will be marked as fixed.

As a follow-up to #2 above, you may instead wish to 'Ignore' certain Hints, which is handled by slightly different functionality - read about ignoring Hints below. It might be prudent to prepare yourself in advance for my scathing tone.

Ignore Hints

In general, we receive a lot of plaudits for our Hints. Our customers seem to think they are useful, and help them do their jobs more efficiently. We spent hundreds of hours lovingly crafting them for your consumption: coding them, scoring them, and writing up their 'How to fix' pages.

Yet.

Every once in a while, a customer will approach us and say things like 'I hate this Hint, can't I just ignore it?'

Up to now, our reaction has always been, 'Fuck off, you ungrateful bastard.'

And whilst this sentiment still remains, now we actually let you do it. And no, before you ask, I'm not going to demo it here; because, frankly, I'm still a bit pissed off that you asked for it in the first place. You can make do with an unordered list.

Ignoring Hints can be controlled in two places:

  • Advanced Settings - this allows you to ignore Hints for a specific Audit/Project.
  • Global Settings - this allows you to ignore Hints for ALL future Audits/Projects.

Follow the links if you really want to know more.

Now get out.

Fixes

  • On sites with hreflang that utilised x-default, the x-default value was causing a conflict, so Sitebulb would report the page as having no hreflang at all.
  • The Security checks were not detecting when TLS 1.3 was enabled on a server. Which is somewhat counter-intuitive.
  • The mixed content export was all coming out on a single row.
  • Scaled images Hint was not firing on certain pages that had scaled images. The code wasn't checking the data-src attribute and bypassing the check.
  • <picture> image references in HTML were being reported as 404s.
  • A specific HTML formatting issue was causing <h1> tags to be classed as 'missing' by Sitebulb when they were actually there.

Archives

Access the archives of Sitebulb's Release Notes, to explore the development of this precocious young upstart:

Free 14 day trial.
Full, unrestricted access.
No credit card required.

Try Sitebulb for Free