Thread management

To get the most out of Sitebulb Server, you may need to familiarise yourself with 'thread management.' This document is intended to help you understand how your server resources are divided up between the various tasks required.

What are threads?

A thread is a basic unit of processor usage on a computer, and they allow you to run processes or programs. Some programs (such as Sitebulb) can enable you to do lots of things at once by using multiple threads.

In general, the more threads you have available, the more processes you can run at once. 

The amount of threads you have available is determined by the number of cores on your machine.

For each core, you have 2 threads:

  • 4 core server = 8 threads available
  • 8 core server = 16 threads available
  • 12 core server = 24 threads available

Note: If you are using Sitebulb Server on AWS/Google Compute, you will find that each vCPU only gives you access to 1 thread - this is because they are virtualized machines rather than dedicated machine ('vCPU' is a virtual CPU, also known as a virtual processor).

Why are threads important with Sitebulb?

Because threads are the core resource that Sitebulb needs in order to run properly. You may be familiar with threads from the Crawler Settings:

Number of threads

The basic equation is pretty straightforward - the more threads you allocate to an audit, the faster Sitebulb will be able to crawl the website. It will have more threads available to download the page content, parse the HTML, perform analysis etc...

But when it comes to the server version, it gets more complex:

  • When a user logs onto the server, this will take up a thread (per user)
  • When audits are set to run concurrently, they will all take from the same pool of threads

So when you are using Sitebulb Server, these are the things that will limit how much you can get out of the software.

What does this look like in practice?

Let's say you have a server with 8 cores, which translates to 16 available threads.

If 3 users want to regularly access the server, this will leave 13 threads available for crawling.

3/16 threads used

Now, if we set two audits running to use 5 threads each, we'd only have 3 threads left.

13 threads used

If you then tried to start another audit running, also at 5 threads, Sitebulb won't let you do it. This would require 18 threads, more than the capacity of the machine. If you tried running it at this speed, Sitebulb would suffer 'thread starvation' and eventually crash - as it would not have enough resources available to physically run.

Sitebulb would allow you to run an audit at 3 threads, but that would literally be using all the threads on the machine, which gives the server no headroom at all, so it's far from ideal.

And more generally, when you think about utilizing your server, it's not a great idea to be hammering most of the processing power for long periods of time, which is why it is important you choose a server that's suitable for your needs (please check our pricing wizard if you need recommendations).

How Sitebulb helps you manage your threads

As we have seen, there are three factors which determine how threads are used;

  1. Users logging onto the machine
  2. Whether you crawl concurrently or not
  3. How fast you crawl websites

And you can have complete control over each of these elements when configuring Sitebulb. 

#1 Users logging onto the machine

You can control how Sitebulb Server is set up by connecting to your server via the Admin secret key.

Once connected, click the purple Server Settings button at the top of the projects list (if you do not see this button, you are not connected using the Admin secret key).

Click Server Settings

This will bring you to a settings page that looks like this:

Sitebulb Server Management

The bottom option here allows you to adjust a value for 'Reserved Threads.' These are basically threads that you do not wish to be used for crawling, which helps ensure that there are sufficient threads saved for users to log onto the server.

Reserved Threads

You set the number of reserved threads, and this number of threads will not be available for crawling. So if you know that 2 users will regularly need to be logging on to the server, you will want to set this value to be at least 2. You may wish to set it a bit higher if you plan to do concurrent crawling and you want to ensure your server has a bit more headroom.

Here are some rule of thumb suggestions:

  • 1 reserved thread if you are the only user (minimum)
  • 2 or 3 reserved threads for 2-4 users
  • 3 or 4 reserved threads for 5-8 users

If you (or your team mates) start to see Sitebulb to become a bit laggy when accessing the reports, it likely means that the server is struggling a bit with all the crawl activity and user activity. If this happens, increase the number of reserved threads.

#2 Whether you crawl concurrently or not

One of the things that Sitebulb Server allows you do to is run audits concurrently, which means that they can run simultaneously alongside one another. When you access the server settings with the admin key (above), you will be able to check a tickbox to enable concurrent crawling:

Concurrent audits 

However, just because you can do it, does not always mean that it is a good idea to actually do it. On smaller servers, concurrent crawling will start to max out the resources so much that it could result in the app becoming unusable, and in the worst case scenario, the server itself actually crashing.

So it is important you determine whether concurrent crawling is actually necessary, and set your server up appropriately.

#3 How fast you crawl websites

On bigger, more powerful machines, you will be able to assign more threads to crawling a particular website - which means it will crawl faster and you can access your audit sooner.

You can adjust the Crawler Settings to increase the number of threads, or Chrome instances (1 Chrome instance = 1 thread) when using the Chrome Crawler.

Instances of Chrome

As always, however, we advise caution with this, since crawling a website too fast can cause the website server to crash - instead we recommend crawling responsibly, which Sitebulb will try to do by default.

If you are working with a site that can handle much faster crawling - and have the necessary permissions to do so - you can theoretically ratchet Sitebulb up to use all the available threads for crawling.

Again, it is not really recommended to do this for extended periods of time. If you think the server is starting to struggle (e.g. app becomes laggy to use), it is probably wise to pause the audit, then update the settings to reduce the number of threads/instances that are being used.

Take responsibility for your server

Sitebulb Server is designed to be run on powerful machines that allow you to do more with the software. However this does not mean that you can necessarily do 'all the things at once', and particularly not if the server you are using doesn't have tons of resources.

By taking responsibility for your server and managing it properly, this will help ensure that you and your team mates are able to continue using Sitebulb without interruption.

There are a number of ways in which you can take responsibility:

#1 Get the right server for your needs

If you know you have 10 teams members who will want to logon to the server concurrently, buying/renting a single small server (e.g. a 6 core server) is asking for trouble. We have put together a set of recommendations, along with some guidance about picking the right server for your situation.

#2 Don't schedule everything all at once

Sitebulb Server is perfectly suited to regular recurring audits. But scheduling all your monthly client audits for the first day of the month means that your server will end up getting hammered on the 1st every single month.

Instead, it would be reasonably straightforward to schedule a handful for the 1st, a handful for the 2nd, a handful for the 3rd etc... while still retaining the desired 'data ready when you need it.'

It also makes sense to run weekly/monthly scheduled your audits to run outside of working hours, this means that more of your machine's resources are available for team members to log onto the machine and view reports.

#3 Pay attention to the dashboard

The 'Registered Servers' list enables you to see remaining resources at a glance:

Server Resources

Use this to help stay on top of the thread management situation, as well as the amount of RAM and disk space available.

0