Indiana Jones And The Temple Of Software Doom

(I originally posted this on my MSDN blog.)

I want to expand a bit more on something I mentioned in my last post about not piling new code on top of old code.

At various times in my career I’ve had the dubious pleasure of inheriting applications that have gone through several different owners over several years.  It’s fascinating how you can, with a lot of effort, reconstruct sequences of events and long-ago circumstances that have been altogether forgotten.  After awhile you can tell who wrote a piece of code, and when, and sometimes why, at a glance.  It’s like peeling back layer upon layer of detritus to reveal mysterious clues and ancient traps.  Seriously, sometimes I feel like I should put on an Indiana Jones hat and start exclaiming, “Now, here’s an excellent specimen from the early ‘Mike’ dynasty!  Watch out, it’ll kill you if you touch it!”

I like to call this experience ‘software archeology” or “forensic development”.  (Aside: I came up with both of those terms on my own, thank you very much, but a quick web search shows that I’m not the first for either of them.  Sigh.)  On the whole, it’s an activity I’d rather not engage in if I can avoid it, though it does have a certain charm when you finally figure out the keys to the puzzles.

How’d those code bases get to be in that state, anyway?  Mostly because the previous developers didn’t take the time to understand the code they inherited, I suppose, and just bolted on new code on top of the old code with an absolute minimum of changes.  That technique seems to make sense in the short term.  After all, no one wants to break working code, right, and if we don’t understand it, then it’s best to avoid touching it as much as possible.  However, in the long term that idea usually ends up creating a byzantine edifice of code that’s very difficult to work with.  You haven’t avoided any work or risk; you’ve merely deferred it to the next guy.

I once inherited an application that uploaded crash dumps from game consoles to a server for processing.  The code that actually sent the files from the console and received them on the server was ridiculously complex and I had a really hard time understanding it.  I peeled back layer after layer of seemingly useless code over several days until I finally managed to reconstruct the entire sequence of events that led to the current state of things; then just shook my head and laughed.  I’m recalling this from memory so I might get a few details wrong, but here’s what I discovered.

You see, originally the application just uploaded a single crash dump file to the server.  When the server noticed that a file had been uploaded, it would process it.  Very straightforward.

Later, a new requirement was added for uploading additional supporting files (like logs) in addition to the crash dump.  Hmm, this added quite a bit of complexity because now uploading isn’t an atomic action any more.  As we upload one file at a time, the server needs to know whether it has all the files and can begin processing or whether it needs to wait for more files to be uploaded.  Someone added a bunch of code that on the client generated a manifest file listing all of the files in the upload, and on the server tracking newly-initiated uploads, partially-finished uploads, and completed uploads.  On the server side this involved spinning up a new thread for each upload, writing status information to the database for each state, and all kinds of things.  There was logic to detect uploads that were abandoned halfway through as well as to deal with unexpected server restarts.

Everything was fine for awhile until a new requirement came along that the application should be able to upload crash dumps from external sites across the internet.  Game companies are usually quite secretive and paranoid about their unreleased projects so the uploaded files needed to be encrypted in transit to avoid leaking confidential information.  Someone whipped up a homebrew encryption algorithm (which later proved to be horribly broken, as virtually all homebrew encryption algorithms are) and added code to encrypt each file before transit and decrypt it on the server.

Fast-forward again and we find that people got impatient waiting for all of these files to be uploaded.  Crash dumps can be quite large, of course, so wouldn’t it be great if we could compress them before uploading them?  Sounds good!  And so code was added to take all of the upload files on the client (including the manifest), generate a zip file containing them, and unpack the zip file (using custom-written code) on the server.

At each point in the story, as far as I can tell, the developer took pains to not change any existing code but rather just add new code as a layer around the old code with minimal, or sometimes even no changes to the existing system.  That sounds good, right?  It’s a time-honored engineering technique.  Heck, it’s even got its own pattern in the GoF book: Decorator.  So what’s wrong?

Well, think about the total behavior of the system that I inherited.  It took a bunch of loose files, cataloged them and generated a manifest listing them all, then encrypted each individual file.  It then took all the encrypted files and put them in a zip file, which it uploaded to the server.  On the server side, it first unpacked the zip file into a directory, wrote some info to the database, spun up a thread, and pointed the thread to the folder.  The thread unencrypted each file, opened the manifest, and started verifying the existence of each file listed in the manifest, writing information to the database as it did so.  Once it figured out that all the files were there, it handed the folder name off to the processing subsystem.

Once my team and I had reverse-engineered all the different layers and figured out what each one did, we realized that we could chop out most of the code.  The manifest system that was built to deal with multi-file uploads was completely unnecessary because we weren’t uploading multiple files any more; we were just uploading a single zip file as an atomic action just like in the original design.  We ripped the whole thing out which drastically simplified the code.  No more tracking state in the database, no more multithreading.

We then ripped out the homebrewed encryption system and used standard AES encryption as supported by standard zip libraries.  This further simplified the code plus allowed us to easily inspect and create encrypted packages with a zip utility for testing and debugging purposes.  Plus the encryption actually, you know, worked.

The final design was that the client took all the files to be uploaded and put them into a standard encrypted zip file which it uploaded to the server.  When the server received the zip file, it unpacked it using a third-party zip library and handed the files off to the processing sub-system.  End of story.

Ok, so all of that is a very long-winded way to make my point: when modifying an existing application, don’t just bolt on new code on top of the old code in order to avoid changes.  If you do, you’ll end up with a Rube Goldberg system that makes absolutely no sense.  Instead, think about the old requirements, the new requirements, and actually change the existing code to reflect the new requirements.  Sure, it’s more work up front, and more risk if you don’t have a unit test safety net, but your code base will wind up much smaller and much more maintainable in the long run.

That old code that you’re trying so hard to preserve untouched?  It’s not an asset, it’s a liability.  Get rid of it.

One thought on “Indiana Jones And The Temple Of Software Doom”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: