The Black Art of Background Processing

For the last few weeks I've been trying to think of a blog post idea that would be interesting to read and write about but also something that has not necessarily been discussed in any great depth and it was not until the other day that something came to me. There we were, working away when someone came into the room and asked “What do you think is the best way to process some long running task in the background? Do you think it's good practice to fork a process from PHP?” and this question got me thinking; is it bad practice to fork a process in PHP? My initial and gut reaction was to say yes. We have so many different options and methods that are designed to process background tasks but when you really start to think about it, it might not be all that bad practice so long as you don't expect a response from the forked process and that you don't effect the performance of Apache or whatever server you use. I still stand by my initial answer of “Yes, it is bad practice” for situations where you may be handling any more than 10 long running tasks but the following code should suffice where a small quantity of requests is concerned and it removes all the additional overhead associated with some other options discussed in this post.

$cmd = "php location/to/some/long/running/task.php"; // Start the process and continue outputting your response to the user  
exec($cmd . " > /dev/null &"); 
echo(...); 

So what are our options?

Well as it turns out, there is a wealth of options to choose from when it comes to background processing including:

  • Cron jobs – everyone knows what a cron job is – simply put, it's a scheduled task

  • Gearman – A highly scalable, language independent, open source background task manager

  • Event listening/triggering – not really practical with PHP

  • A polling script – A bit rudimentary but it works for a lot of things

  • Database triggers – only really an option if the task can be written in a stored procedure

There are more, I'm sure but theses are the options that are most widely used and therefore the ones I will discuss.

Cron jobs

Cron jobs have existed in Unix based operating systems since 1979 (Unix V6), they have served us well and will continue to do so for the foreseeable future but there are some drawbacks to using this system to process background tasks in a web environment. In the fast paced, dynamic world of the modern day web any delay is considered unacceptable. If you and more importantly your clients have to wait for a cron job to be scheduled and pick up your task you could very well loose your competitive edge. To fix the problem of long delays we could set the cron to run every minute but in a server environment where every instruction execution counts this is not a ideal solution, after all, we're paying for the CPU time. We need to come to a compromise and this tends to be around the 5 minute mark, this still means your clients could be waiting up to 5 minutes before they see the effects of their actions, for a database import is not a problem but what if the action was to divert a plane away from a potential crash? Cron jobs are clearly unacceptable for time critical tasks or where there may be hundreds or thousands of tasks that need running every minute.

Gearman

Gearman is a highly scalable, powerful and potentially load balanced open source solution to the background task problem. Gearman is a scheduling manager that can handle many workers across multiple nodes, it is very good at handling bulk tasks such as mailing list distribution, user uploads, image manipulation, large dataset processing etc. and it is very flexible in what you can do with it, it simply runs your background code when a new task is added. One major disadvantage of Gearman is its overhead, we have used Gearman for a few of our projects here at Clock and its certainly not the easiest tool to setup, for large projects with thousands of tasks it's the ideal solution but for small websites, in my personal opinion its overhead outweighs its advantages.

Event listening/triggering

There's no real event handling API built into PHP and that makes listening to events a near impossibility. Although event listening seems to be ruled out in PHP, we can however trigger events using the PHP “exec” function to call the system’s “kill” command with the event number and process we want to trigger. The listening process obviously has to be built in a language that supports listening to events. I'm aware of a PECL library named “libevent” but it is still in beta and not really usable in production environments just yet. There are some obvious disadvantages to this approach, for example you have to load the kill process into memory every time you call an event, your listening process has to then fork the hard part off so that it can continue listening and it all seems a little hacky. Nevertheless, signals can be very useful and lightweight so maybe they are the ideal solution for the smaller environment. Signals can be used to process a request immediately and prevent your clients from waiting for some task that never seems to end.

A polling script

A polling script is another way to go and it's a very simple, lightweight way to process tasks in the background. The script will need to be set up in such a way that it runs regularly enough to process the tasks in a timely manor but not so often that it takes up too much processor time. A basic setup will look similar to the following:

$taskManager = new TaskManager(); 
    while(!$taskManager->hasClosed()) { 
        while($task = $taskManager->getTask()) {
            // On a high load server or a really long running task we should 
            // pass this work onto another thread, possibly on a different node. 
            someTask($task); 
        } 
    sleep(10); 
} 

The obvious problem with this approach is that as soon as we get a bunch of tasks pushed onto the task queue, it will take 

time to process the queue. A solution to this problem would be to fork or pass the task onto a different node, but if you’re thinking about that then this solution is probably not the one for you. If we take a closer look at the task manager we may find we have a FIFO type queue, the problem with this is that it probably isn't thread safe.

I'm going to briefly discuss a few techniques to achieve thread safety because I think it ties in nicely with shared access to a resource which the methods we've discussed thus far all do.

Shared Memory and Mutual Exclusivity

While shared memory is one type of resource that needs to be protected when inside a multithreaded environment, any resource that is shared and you cannot guarantee fully atomic read/write operations should be protected using some sort of mutual exclusivity scheme. The use of a Mutex or a Semaphore can ensure that only one, or only a set amount of processes in the case of a Semaphore, can access the shared resource. I’ll give a brief overview of both but I will not go into implementation specifics because it is OS dependant.

Mutex

A Mutex, simply put is like a token that is passed between each process trying to gain access to a shared resource. The process trying to gain access to a resource must first gain control over the token before it can proceed to use the resource. When one process has control over the Mutex, no other process can use the resource.

Semaphore

A Semaphore is very similar to a Mutex except for one major difference, a Semaphore has a counter built into it so it can allow multiple processes to gain access to a shared resource, once the counter has been exhausted it will not allow any more processes to gain access to the resource. The counter limit it specified at setup time.

In both cases, each process trying to gain access to the shared resource must block until the Mutex has been released or a process has decremented the counter in the Semaphore thus allowing one more process to gain access to the resource. The root process to start up must setup the Mutex or Semaphore before any other process tries to gain access to the resource. When using any sort of mutual exclusivity you should be aware of causing race conditions or locking out of a process.

Shared Memory

Shared memory is simply a block of memory that is shared by two or more processes and allows them to read and write to that block of memory. I simply used it here because it is an example of a shared resource that can obviously not be written to at the same time.

Fortunately, thinking about mutual exclusivity is not something we have to do very often because a lot of resources we may use such as a database have already thought about this and will not allow two processes to write to the database at the same time, leaving us developers to concentrate on the more important task of creating great products.

Database Triggers

I won't go into much detail on database triggers, it’s more of a side note. If all you are doing is a simple SQL query then why not create a database trigger that fires every time you insert a row into the database and leave the hard work up to the database.

So, what's with the title? I hear you ask. It's true, it's not really a black art but it caught your attention didn't it! Please note the methods and techniques I've suggested in this post are opinion and not necessarily the best or most efficient way to achieve background processing.

Please use the comments section below, it would be nice to have an active discussion on the topic.

Join Us

Come and work for Clock

Clock is made up of bright, hard-working and talented people and we're always on the look out for more. You can browse the current jobs below or follow us @clock for the latest vacancies.

View
Jobs