Scrape, search and extract specific content on a website while Sitebulb crawls.
I love Sitebulb; you set up your crawl and all the insights are there. It really sets you up for success. And I love the Sitebulb content extraction tool; you can grab stuff on the fly and it really makes your life easier. Over the years I've been using it, it's saved me a good amount of tears and hours.
Content extraction allows you to customize the output you get from Sitebulb's website crawler, and include specific datapoints that you need to help you make better decisions. You can use it to help determine popular content by scraping comment counts, assign author names to URLs from a blog, or count the number of products in each category on an ecommerce store.
Content search enables you to identify pages which contain specific words or phrases, which is a crucial feature for rebranding/migration exercises, and extremely useful for topic modelling and finding internal linking opportunities.
Sitebulb's extraction setup is a cinch for beginners to get the hang of, whilst also offering extremely advanced configuration options for more experienced users.
If you're familiar with custom extraction on other crawler tools, you'll know how frustrating it can be to set it up correctly. Sitebulb is designed to remove all these frustrating pain-points, with an intuitive system that does the heavy lifting for you:
This means it works on any website you throw at it. It means you don’t need a degree in advanced Regex to figure out what selector to choose. And it means you don’t need to crawl the website 37 times in order to test your selectors.
See a piece of data you'd like to extract, then just point-and-click, and Sitebulb will choose the correct CSS selector for you.
To perform content extraction on other crawler tools, divining the correct CSS selector can be a bit of a mission, involving digging around in Google Chrome DevTools and hoping you have got the right one.
The handy 'Test' tab means you can ascertain the result of your extraction setup without running a crawl, which makes a significant time saving.
The test feature allows you to tweak different extraction options to ensure you have picked the right selector and data type, or load in different URLs to ensure that the extraction works across different pages.
Extraction is not just restricted to simple scraping, you can also perform operations such as a count, or checking if an element exists. You can also apply URL pattern matching specifically for each extraction rule, to ensure computer resources are not wasted unnecessarily.
Scraping data with regex is typically an advanced use-case, and Sitebulb includes advanced functions to suit. You can specify a CSS selector in which to perform the regex pattern matching - rather than simply across the entire HTML document - and additionally pull from specific regex groups.
Content search is a simple concept; provide a word or phrase for the crawler to check for in the HTML of every page. The setup in Sitebulb does not overcomplicate this basic process:
As always, however, Sitebulb offers that little bit more for those that want it, allowing you to combine words and phrases and even set exclusion patterns.
Make you and your team indispensible to your clients, by providing meaningful business data to satisfy numerous different departments.
Explore some of the things you can do with content extraction and search:
Make the merchandising team love you by providing a regular list of out of stock products.
Collect vital competitor data like prices, product codes and stock availability.
Search for multiple keywords at once to find content overlap and cannibalisation, or opportunities for internal linking.
Find, fix and communicate technical issues with easy visuals, in-depth insights, & prioritized recommendations across 300+ SEO issues.
Get all the capability of Sitebulb Desktop, accessible via your web browser. Crawl at scale without project, crawl credit, or machine limits.