Third-Party Components – Hidden Technical Debt

I was recently reminded of something I learned many years ago before coming to Ticketmaster from people much smarter and more experienced than myself. Back then I was pushing to introduce a set of third-party libraries to help lay the groundwork for a replacement for our flagship product, a mainframe based mail and groupware system. The logic, I thought, was flawless: The libraries would give us cross-platform support for a number of key functional areas including network communication, database access for many different database systems, file system, threading, you name it. Writing cross-platform code is pretty straight-forward until you have to touch the metal, and then it can be…challenging. Why re-invent the wheel, I thought, when somebody else had already invented some very nice wheels?

The company selling the libraries – yes, there was a time before Github and the explosion of open source libraries – was successful, well respected, produced quality libraries and offered great support. I did my research, readied my arguments and presented it all to management and senior developers. They were, in a word, underwhelmed. When I asked why they didn’t think it was a good idea I got the simple answer, “We’ve had nothing but bad experiences with these types of things”.

I was disappointed but there was a lot of work to do so I just let it go. But it did stick with me. I mean, why would seemingly smart and experienced developers turn their noses up at re-usable components solving common problems? Over the years however I started to understand their reluctance. Nothing truly catastrophic, mind you, just a lot of time spent wrestling with the devil in the details. And that is what I was reminded of the other day at Ticketmaster.

A Simple Job

The job seemed simple enough: Upgrade several open source components we use, all from the same group, from version 2.5 to 2.6. Certainly there couldn’t be any major changes, and the previous upgrade went smoothly enough. What could possibly go wrong? So we upgraded the components, ran the tests and BLAM, the first sign of trouble: a bunch of our tests were broken. Well, not just the tests. Our app was broken. In the end, it took a couple of people a couple of days to work through all of the issues discovered. And while QA always intended to perform a smoke test after the upgrade, testing was much more extensive than planned because of the issues during the upgrade.

This story would have ended happily enough except our app, a web-based e-commerce site, came out in production and BLAM, two showstopper bugs that required a rollback and immediate fixes. And both could be tied directly to changes in the third-party components we had just upgraded. This is not to say that it was bugs in the components that caused the problem. Rather, changes in the component code combined with our existing or new code lead to unintended, and more importantly, undetected side effects.

The Devil IS the Details

In the one case, the behavior of one component method had changed. Combined with some new, and seemingly unrelated changes in our code, the side effect showed itself in a very specific scenario with the result that a large group of site visitors would be unable to buy things on the site without first encountering an error. In the second case, a deprecated method for initializing a widely used component had to be updated to use newer and less clear methods. In this case, we simply implemented the new method wrong with a small, but very important side effect: we were passing the proxy server ip address to backend systems instead of the client’s ip address where the actual client ip address is an important part of the anti-fraud system.

So what’s the lesson of all this? Well some would say more tests are the answer. And they’d be wrong. In the first case, the error appeared in only one very specific scenario with a very specific set of pre-conditions. It was triggered by changes in our code, which we knew about, that interacted with changes in the behavior of the third party component, which we didn’t know about. Couple this with the previously unknown set of preconditions to trigger the error and you see that nobody could have foreseen the potential error and written a test to cover it.

In the second case, where we implemented the new method incorrectly, we had a test covering it. The problem was that the test was wrong. And this was owing to a tiny detail in the implementation of one of the component’s internal methods. And for the test-first proponents out there, yes, the test failed, was implemented and then passed. Problem is it was a bug in the test that made it pass.

To me the lesson is pretty simple: Think long and hard before pulling third-party stuff into your code-base. Don’t be blinded by “how easy things are to integrate” or “look at all the cool stuff we get” or even “everybody else is using it”. You really need to understand what you are getting yourself into and have a solid plan for how to maintain what has now become part of your code base. Because in the end, this is technical debt that you will be living with for quite a while.

Implementing a DevOps Strategy across multiple locations & product teams

Over the last 18 months, a change has begun within the Ticketmaster International Team. Barriers are being broken down between the engineering and operational teams, our different product delivery teams are being aligned and knowledge sharing across teams is happening more and more. What’s changed? We developed a strategy based around DevOps to create a leaner higher performing organisation and our journey is underway.

As with many large mature international companies our organisation is probably not unique; our Product delivery & TechOps teams are distributed across 5 geographical locations: Belgrade (Serbia), Gothenburg (Sweden), London (UK), Quebec (Canada) and Stoke (UK). Across these teams we manage about 15 different platforms. Our challenge was to create a DevOps strategy and implement change in a flexible manor is across all delivery teams.

With any distributed organisation we have suffered from communication barriers, although tooling such as Skype, Slack, Zoom, are all helping to break down the barriers. However, more fundamental issues existed such as terminology, multiple tools being used for the same job, skills and abilities differences between locations, and silos. A further example of silos was with our TechOps team being a separated centralised group, with different goals to the engineering team. When different groups that need to work together are not aligned and have different goals this can cause friction. In this case, because the way we’ve been organised, the multiple concurrent requests coming into TechOps from the various engineering teams has caused problems in their ability to service all teams at the same time which causes delays.

The differences in tooling and processes have created a barrier that slows us all down. We needed a new approach and developing a DevOps strategy has been one of the answers for us.

Our DevOps Strategy

In developing our DevOps strategy we wanted all teams to speak the same language, and have a shared understanding and skills. We wanted to break down the silos that had been built over time, bringing teams closer together and aligning resources to delivering products, so that we can be more agile, nibble, developing and releasing high quality products quickly, efficiently and reliably. Echoing the Agile manifesto principles:

Our highest priority is to satisfy the customer through early and continuous delivery of valuable software – Principle #1

Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale – Principle #3

Coalescing our ambitions and desires, mindful of the agile manifesto principles we defined 4 main objectives for our DevOps strategy:

  • Maximise for delivering business value.
  • Maximise the efficiency and quality of the development process.
  • Maximise the reliability of applications and environments.
  • Maximise for service delivery.

With these objectives we started to define requirements to achieve them. Quickly we ran into a mountain of requirements and with that a prioritisation nightmare: how to prioritise the requirements across 5 global locations and 15+ delivery teams, each with different needs.

The Maturity Models

After several rounds of attempting to prioritise in a sensible way we began to arrange the requirements into themes and with that a Maturity Model evolved; one maturity model for each objective.

Maximise for delivering business value. This goal is centred on continuous delivery, creating fast runways down which we can launch our applications.

devops-strategy-image00

Maximise the efficiency and quality of the development process. This goal is centred on continuous integration, creating the environment to launch a battery of automated tests and gain fast feedback to be able to evolve code.

devops-strategy-image01

Maximise the reliability of applications and environments. This goal is centred on instrumentation, creating the visibility into the inner workings of our applications for root cause analysis and fault tolerance.

devops-strategy-image02

Maximise for service delivery. This goal is centred on organisational change, creating alignment of cross-functional teams responsible for delivering software.

devops-strategy-image03

The Maturity Models are great; they provide a vision of what our strategy is. Defining the steps required to progress through to achieving advanced levels of DevOps, we can set long term and short term targets on different themes or levels to be reached.  They’re modular so we can change the strategy if improved technology or processes become apparent, and fill in gaps where some exist.

Flexible Planning

The nice thing about the maturity models is the flexibility they provide. They are maps that can guide you from a low maturity to a high maturity level of DevOps.  If you imagine how you would use maps to plan a route from A to B, depending on various conditions, such as day of week, time of day, potential traffic density, road speed, road works, etc the routes chosen will be most appropriate given a set of circumstances.

devops-strategy-image04

The DevOps maturity models are true roadmaps, as opposed to a linear list of requirements, allowing each individual delivery team to navigate their own path dependent on their context based on what is most important to them or what concerns they have at any point in time.  Furthering this flexibility, the Maturity Models allow teams to change their routes and reprioritise their plans in consort with business changes and needs.

When individual teams select and complete a portion of the maturity model no other team has yet reached comes with an additional benefit. The problems solved by those teams can be shared with all other teams allowing them to achieve that same work faster avoiding the potential pitfalls that would have been learnt by the early adopting team.

Even though all product delivery teams have the flexibility to select their own routes to achieving our DevOps objectives, ultimately everyone ends up at the same location. So the maturity models enable various programs of work to be planned across different teams with very different needs and abilities.

Standardisation

As good as our maturity models are they weren’t able to solve a couple of issues which still existed: we’re using multiple tools to do the same jobs and we speak different languages because we use different terminology for the same things. To solve this prior to kicking off our strategy we set up focused working groups to define and agree a set of standards for tooling, definition of terms (e.g. naming conventions), best practices (e.g. code reviews) and core specifications (e.g. Logging, Heartbeats & Health checks).

Our Core Tooling

  •         Git – Source Control
  •         GitLab – Git Management & Code Reviews
  •         Jenkins – Application Builds
  •         SonarQube – Code Quality Reporting
  •         Sonatype Nexus – Package Management
  •         Rundeck – Operational Support
  •         Octopus Deploy – Deployment (Windows only)
  •         Chef – Configuration Management

Standardising our tooling and specifications for implementing instrumentation meant we could reduce support overheads, share knowledge and solve problems once. Guidelines and best practices meant we’re working in the same ways and all have shared understanding. Definition of Terms meant we could all speak the same language and avoid confusion.

With the maturity models and standards we have created a shared vision and enabled flexibility for each product delivery team to plan what they want to work on. We have created a framework that enables all product delivery teams start working on achieving the DevOps objectives in parallel but focusing on what’s important to their needs at any given point in time.