I was recently reminded of something I learned many years ago before coming to Ticketmaster from people much smarter and more experienced than myself. Back then I was pushing to introduce a set of third-party libraries to help lay the groundwork for a replacement for our flagship product, a mainframe based mail and groupware system. The logic, I thought, was flawless: The libraries would give us cross-platform support for a number of key functional areas including network communication, database access for many different database systems, file system, threading, you name it. Writing cross-platform code is pretty straight-forward until you have to touch the metal, and then it can be…challenging. Why re-invent the wheel, I thought, when somebody else had already invented some very nice wheels?
The company selling the libraries – yes, there was a time before Github and the explosion of open source libraries – was successful, well respected, produced quality libraries and offered great support. I did my research, readied my arguments and presented it all to management and senior developers. They were, in a word, underwhelmed. When I asked why they didn’t think it was a good idea I got the simple answer, “We’ve had nothing but bad experiences with these types of things”.
I was disappointed but there was a lot of work to do so I just let it go. But it did stick with me. I mean, why would seemingly smart and experienced developers turn their noses up at re-usable components solving common problems? Over the years however I started to understand their reluctance. Nothing truly catastrophic, mind you, just a lot of time spent wrestling with the devil in the details. And that is what I was reminded of the other day at Ticketmaster.
A Simple Job
The job seemed simple enough: Upgrade several open source components we use, all from the same group, from version 2.5 to 2.6. Certainly there couldn’t be any major changes, and the previous upgrade went smoothly enough. What could possibly go wrong? So we upgraded the components, ran the tests and BLAM, the first sign of trouble: a bunch of our tests were broken. Well, not just the tests. Our app was broken. In the end, it took a couple of people a couple of days to work through all of the issues discovered. And while QA always intended to perform a smoke test after the upgrade, testing was much more extensive than planned because of the issues during the upgrade.
This story would have ended happily enough except our app, a web-based e-commerce site, came out in production and BLAM, two showstopper bugs that required a rollback and immediate fixes. And both could be tied directly to changes in the third-party components we had just upgraded. This is not to say that it was bugs in the components that caused the problem. Rather, changes in the component code combined with our existing or new code lead to unintended, and more importantly, undetected side effects.
The Devil IS the Details
In the one case, the behavior of one component method had changed. Combined with some new, and seemingly unrelated changes in our code, the side effect showed itself in a very specific scenario with the result that a large group of site visitors would be unable to buy things on the site without first encountering an error. In the second case, a deprecated method for initializing a widely used component had to be updated to use newer and less clear methods. In this case, we simply implemented the new method wrong with a small, but very important side effect: we were passing the proxy server ip address to backend systems instead of the client’s ip address where the actual client ip address is an important part of the anti-fraud system.
So what’s the lesson of all this? Well some would say more tests are the answer. And they’d be wrong. In the first case, the error appeared in only one very specific scenario with a very specific set of pre-conditions. It was triggered by changes in our code, which we knew about, that interacted with changes in the behavior of the third party component, which we didn’t know about. Couple this with the previously unknown set of preconditions to trigger the error and you see that nobody could have foreseen the potential error and written a test to cover it.
In the second case, where we implemented the new method incorrectly, we had a test covering it. The problem was that the test was wrong. And this was owing to a tiny detail in the implementation of one of the component’s internal methods. And for the test-first proponents out there, yes, the test failed, was implemented and then passed. Problem is it was a bug in the test that made it pass.
To me the lesson is pretty simple: Think long and hard before pulling third-party stuff into your code-base. Don’t be blinded by “how easy things are to integrate” or “look at all the cool stuff we get” or even “everybody else is using it”. You really need to understand what you are getting yourself into and have a solid plan for how to maintain what has now become part of your code base. Because in the end, this is technical debt that you will be living with for quite a while.