Lamorna Engine features a hand made thread pool with a work queue that supports the notion of job dependencies.
A job consists of the following elements:
- pointer to the function to be executed
- pointer to a struct containing parameters to that function.
- dependency counter, indicating how many jobs it must wait on before running
- permission list, indicating which jobs are dependant on that job’s completion.
All jobs with dependencies > 0 are placed in a ‘pending’ queue. As jobs finish they decrement the dependency counter of the jobs in their permission list. When a job’s dependency counter reaches zero, it can be added to the ‘ready’’ queue.
The worker threads are created at start up and promptly slept until signalled by the master thread that jobs have been added to the ‘ready’ queue. A worker thread wakes in it’s critical section, takes a job from the ‘ready’ queue, exits the critical section and executes the job. When done it re-enters the critical section, adds the job to the ‘finished’ queue and selects another job, or sleeps if there are none.
It is the job of the master thread to keep the ‘ready’’ queue full. It wakes when that queue is not full, processes the ‘finished’ queue, adding any jobs that now have zero dependencies to the ‘ready’ queue. When all jobs are complete control returns to the master thread.
Naturally a batch of jobs submitted to the thread pool must have at least one job with zero dependencies which will kick off the process.
The above is based on an approach I read about here by Charles Bloom
Using the above I can characterise an entire game frame as a single large submission to the thread pool, expressed as a network of interdependent jobs. This certainly isn’t to say those dependencies are highly optimised, in fact large sections of game logic still run serially due to the difficult nature of parallelising such work, but it is all still expressed using the same scheme. Rendering, dynamic lightmap generation and collision processes are all highly parallelised.
Lamorna engine makes use of what I style a command buffer to aid with parallelisation. At the end of each frame the details necessary to render each visible model; position, scale, rotation matrices etc are written to a command buffer. This data is double buffered, so the renderer can be working on the previous frame’s command data while the current frame is being populated. In addition to rendering data, sound and lighting data are also written down so audio and lightmap generation can be run concurrently as well.
The queues are implemented as ring buffers with virtualised indices as described in Fabien’s excellent article.
The number of worker threads is set to match the number of processors on the machine.
Improvements: I recently became aware of lock-free synchronisation after reading about it in the text ‘Concurrent programming on Windows‘, and following that up by watching Casey Muratori’s thread work on Handmade Hero. I’d like to explore that as an alternative to critical section entry, but truth is I’m not sure how much difference it would make. From what I understand a thread attempting to enter a critical section will spin-wait for a bit anyway before being switched out. The time spent in the critical section for each worker thread should be minimal, since it just writes down it’s completed job and picks up another.
Till next time!