GenesisEngine: The Task Parallel Library Is Great But Threading Can Still Bite You

(I originally posted this on my MSDN blog.)

The new Task Parallel Library in the .Net Framework 4.0 is intended to simplify the process of adding concurrency to your application and it does a great job of it.  I’m really impressed by how well it hides the mechanics of threading and lets you reason declaratively about what you want to do, not how it needs to be done.

My real-time engine, um . . . wasn’t

In GenesisEngine, I had a serious need to do some background processing.  As you move around in the world the engine dynamically generates appropriately-detailed quad nodes based on where you are, and generating the terrain meshes for those nodes takes a lot of CPU work.  Previously I was splitting nodes into more detailed ones on the main game loop thread during the Update pass and it caused a lot of stuttering in the frame rate.  In fact, if you were close the surface you basically couldn’t move in real time; you had to turn updates off, move the camera, turn updates back on, and wait for the detail nodes to be generated for your new location.  Yeah, it pretty much sucked.

Clearly it would be desirable to have all of that expensive node splitting and mesh generation happen on a separate thread.  First, my laptop has two CPU cores and one of them was just sitting there idle.  My desktop machine at work has four cores so three of them were twiddling their thumbs.  The future is multi-core so let’s get on that train; ok, we need concurrency.

Second, I’d like the the game engine to continue to draw smoothly with whatever quad nodes it has at the moment and then later pick up more detailed nodes as they become available.  I don’t want the draw loop to be delayed at all by waiting for nodes to be split.  That means that not only does the node splitting need to be concurrent but it also needs to be asynchronous.

The Task Parallel Library to the rescue

Ok, so that didn’t sound like fun.  I’ve written a decent amount of threading code from scratch in my time and while the principles are fairly straightforward the details can be an absolute killer.  I wasn’t looking forward to doing it again in GenesisEngine so, being a properly lazy developer, I decided to wait until I could use the Task Parallel Library to do it for me.  Once Visual Studio 2010 was out and the XNA 4.0 CTP was released, I had all the pieces in place to fix my problem.  Many excellent articles have already been written on how to use the TPL so I won’t bother with a general tutorial; I’ll just present a walkthrough of how I used it to solve a specific problem.

First, let’s look at the code I was trying to make concurrent and asynchronous:

private void Split()
{
    var subextents = _extents.Split();
    foreach (var subextent in subextents)
    {
        var node = _quadNodeFactory.Create();
        node.Initialize(_planetRadius, _planeNormalVector, _uVector, _vVector,
                        subextent, Level + 1);
        _subnodes.Add(node);
    }
    _hasSubnodes = true;
}

The Split() method is pretty straightforward:

  1. Divide the extents of the node into four smaller extents.
  2. Creates new child nodes for each sub-extent.
  3. Add the new nodes to the list of children.
  4. Set a flag indicating that it now has children.

The strategy for the new implementation using the Task Parallel Library is pretty similar:

  1. Divide the extents of the node into four smaller extents.
  2. Create four tasks to run asynchronously and in parallel.  Each task will create one child node.
  3. Create another task that will run when all of the first four tasks have completed.  This task will add the new nodes to the list of children and set flags indicating that that split is finished and the node now has children.
private void Split(DoubleVector3 cameraLocation, DoubleVector3 planetLocation)
{
    var tasks = CreateBackgroundSplitTasks(cameraLocation, planetLocation);
    CreateSplitCompletionTask(tasks);
}

List<Task> CreateBackgroundSplitTasks(DoubleVector3 cameraLocation, DoubleVector3 planetLocation)
{
    _splitInProgress = true;
    var subextents = _extents.Split();
    var tasks = new List<Task>();
    foreach (var subextent in subextents)
    {
        var capturedExtent = subextent;
        var task = Task.Factory.StartNew(() =>
        {
            var node = _quadNodeFactory.Create();
            node.Initialize(_planetRadius, _planeNormalVector, _uVector, _vVector,
                            capturedExtent, Level + 1);
            node.Update(cameraLocation, planetLocation);
            return node;
        });
        tasks.Add(task);
    }
    return tasks;
}

void CreateSplitCompletionTask(List<Task> tasks)
{
    _splitCompletionTask = Task.Factory.ContinueWhenAll(tasks.ToArray(),
    finishedTasks =>
    {
        foreach (var task in finishedTasks)
        {
            _subnodes.Add(task.Result);
        }
        _hasSubnodes = true;
        _splitInProgress = false;
    });
}

The second implementation clearly has more code but considering what’s being accomplished here, it’s remarkably compact and elegant.  In CreateBackgroundSplitTasks(), we first set a flag indicating that a split is in progress.  This is so that when the main game thread does the next Update pass it won’t try to split the node again.  Next, we use the StartNew() method of the TPL task factory to create and run tasks that create new nodes.  The work that the task performs is defined in a lambda expression.  Each task starts running immediately and independently in the background and will return a reference to the newly-created node when it’s done.  Finally, CreateBackgroundSplitTasks() returns a list of all of the tasks it started.

How does the node know when all of its children are created and it’s safe to draw them?  For that we create a fifth task that is dependent on all of the previous four.  In CreateSplitCompletionTask() we use the very handy ContinueWhenAll() method of the TPL task factory to create another task that will only run after all of our node-creation tasks have completed.  This task puts all of the created nodes into the child list and sets some flags indicating that the split is finished and that the node now has children.

The great thing about the TPL is that nowhere in this code did I need to think about which threads are going to run which blocks of code, or how many threads I should start up, or even the mechanics of how the different threads should signal to each other.  How many cores do I have on this machine?  Don’t know, don’t care.  It’s all very declarative in nature.  It’s just, “Go away and do these four things in parallel as fast as you can, and when they’re all done, do this fifth thing.”  That’s it.

But!

You still have to think about the consequences

Using the TPL does NOT give you a get-out-of-jail-free card.  The TPL lets you elegantly express your good concurrency intentions but, as they say, the road to hell is paved with just such.

If you read both versions of Split() above you’ll notice that the TPL version does something that the single-threaded, synchronous version did not: after a node is created, we immediately call Update() on the node.  It took me quite a while to infer the need for that.

When I first wrote the concurrent implementation it seemed to run and generate child nodes ok but there was occasionally a strange artifact where a recently split node wouldn’t be drawn correctly at first, leaving a black hole where it was supposed to be.  The artifact didn’t happen for all new nodes, only a few, and it only lasted for one frame so it was difficult to see exactly what was going on.  I eventually deduced the problem by carefully reasoning about the possible interleaving of operations in the parallel implementation.

Due to the way the synchronous implementation was written, newly-created nodes were guaranteed to be updated before they were drawn.  This didn’t happen in Split() but happened later in the update pass so it was kind of an implicit contract.  In the parallel implementation, on the other hand, there was no particular guarantee that a new node would be updated before being drawn the first time.  It turned out that due to timing the background split operation would usually complete between the draw and update passes so it would get updated in the next pass and then drawn, and everything was fine.  Sometimes, however, the background split would complete after an update pass but before a draw pass, and thus would be drawn without ever having been updated.  The code didn’t crash in this case (it would have been easier to debug if it had), but it didn’t draw as expected either.

The fix was to ensure that each node gets both initialized and updated in the background task before declaring the split operation complete.  The new nodes are updated with information that is slightly out of date but that doesn’t seem to matter in practice.

This was one of those tricky concurrency bugs that seems pretty trivial when I describe it in hindsight but it was pretty baffling when I first saw it.  The lesson: the TPL lets you easily implement concurrency but it doesn’t (and can’t) free you from the responsibility of understanding all the myriad of potential race conditions, deadlocks, and other assorted multi-threading nastiness.  Sorry, even if you have a nifty library to do the heavy lifting, you still need to understand what it’s doing under the covers and how it will impact your code.

Concurrency changes contracts

The asynchronous implementation created another problem I had to solve.  Some of my unit tests started failing.  It turned out this was because the tests were causing a node to split and then testing that the correct things happened (child nodes created, etc).  The tests were originally written with the assumption that once the call to Split() returned, the post-conditions could be immediately asserted, but with the async implementation that was no longer true.

The fix here was to capture a reference to the split completion task and make it available to the unit tests so that they could initiate a Split then explicitly wait for the completion task to finish.  This necessitated a bit of code in the QuadNode class that serves no other purpose than to aid unit testing but I don’t see any way around it.  The contract of Split() changed when I changed its implementation.  This aspect of the contract wasn’t expressed in the method signature but it changed all the same and required a corresponding change in the tests.  This just emphasizes the point that you have to think about the consequences every time you pull out the TPL.

The TPL is goodness

The TPL is a very fine piece of work and it makes implementing concurrency and parallelism dramatically easier.  Sure, I still ran into a few hiccups but that’s just the nature of the problem space.  I was very pleased with the result after I was finished; I can now move around in GenesisEngine and the frame rate stays pretty smooth at all times.  The splitting and merging of nodes happens in the background and will eventually catch up with my movements.  The engine scales very well with additional cores; on a quad-core processor it’s about three times as fast at generating new terrain as it is on a dual-core.  (The main game thread runs as fast as possible right now and takes up most of one core, leaving one core available for terrain generation on a dual-core processor and three cores available on a quad-core.)  The right thing just happens magically.

There’s still a lot of work to do to make my implementation less naive.  For instance, if a node split is scheduled in the background but the camera moves out of range before it happens, the split request should be discarded without actually doing it.  Also, priority should be given to nodes that are currently in the viewing frustum so the user sees the maximum amount of detail at all times.  The TPL isn’t quite magical enough to do all that for me but I expect it’ll give me the tools to do it in a compact and elegant way.