Third-Party Components – Hidden Technical Debt

I was recently reminded of something I learned many years ago before coming to Ticketmaster from people much smarter and more experienced than myself. Back then I was pushing to introduce a set of third-party libraries to help lay the groundwork for a replacement for our flagship product, a mainframe based mail and groupware system. The logic, I thought, was flawless: The libraries would give us cross-platform support for a number of key functional areas including network communication, database access for many different database systems, file system, threading, you name it. Writing cross-platform code is pretty straight-forward until you have to touch the metal, and then it can be…challenging. Why re-invent the wheel, I thought, when somebody else had already invented some very nice wheels?

The company selling the libraries – yes, there was a time before Github and the explosion of open source libraries – was successful, well respected, produced quality libraries and offered great support. I did my research, readied my arguments and presented it all to management and senior developers. They were, in a word, underwhelmed. When I asked why they didn’t think it was a good idea I got the simple answer, “We’ve had nothing but bad experiences with these types of things”.

I was disappointed but there was a lot of work to do so I just let it go. But it did stick with me. I mean, why would seemingly smart and experienced developers turn their noses up at re-usable components solving common problems? Over the years however I started to understand their reluctance. Nothing truly catastrophic, mind you, just a lot of time spent wrestling with the devil in the details. And that is what I was reminded of the other day at Ticketmaster.

A Simple Job

The job seemed simple enough: Upgrade several open source components we use, all from the same group, from version 2.5 to 2.6. Certainly there couldn’t be any major changes, and the previous upgrade went smoothly enough. What could possibly go wrong? So we upgraded the components, ran the tests and BLAM, the first sign of trouble: a bunch of our tests were broken. Well, not just the tests. Our app was broken. In the end, it took a couple of people a couple of days to work through all of the issues discovered. And while QA always intended to perform a smoke test after the upgrade, testing was much more extensive than planned because of the issues during the upgrade.

This story would have ended happily enough except our app, a web-based e-commerce site, came out in production and BLAM, two showstopper bugs that required a rollback and immediate fixes. And both could be tied directly to changes in the third-party components we had just upgraded. This is not to say that it was bugs in the components that caused the problem. Rather, changes in the component code combined with our existing or new code lead to unintended, and more importantly, undetected side effects.

The Devil IS the Details

In the one case, the behavior of one component method had changed. Combined with some new, and seemingly unrelated changes in our code, the side effect showed itself in a very specific scenario with the result that a large group of site visitors would be unable to buy things on the site without first encountering an error. In the second case, a deprecated method for initializing a widely used component had to be updated to use newer and less clear methods. In this case, we simply implemented the new method wrong with a small, but very important side effect: we were passing the proxy server ip address to backend systems instead of the client’s ip address where the actual client ip address is an important part of the anti-fraud system.

So what’s the lesson of all this? Well some would say more tests are the answer. And they’d be wrong. In the first case, the error appeared in only one very specific scenario with a very specific set of pre-conditions. It was triggered by changes in our code, which we knew about, that interacted with changes in the behavior of the third party component, which we didn’t know about. Couple this with the previously unknown set of preconditions to trigger the error and you see that nobody could have foreseen the potential error and written a test to cover it.

In the second case, where we implemented the new method incorrectly, we had a test covering it. The problem was that the test was wrong. And this was owing to a tiny detail in the implementation of one of the component’s internal methods. And for the test-first proponents out there, yes, the test failed, was implemented and then passed. Problem is it was a bug in the test that made it pass.

To me the lesson is pretty simple: Think long and hard before pulling third-party stuff into your code-base. Don’t be blinded by “how easy things are to integrate” or “look at all the cool stuff we get” or even “everybody else is using it”. You really need to understand what you are getting yourself into and have a solid plan for how to maintain what has now become part of your code base. Because in the end, this is technical debt that you will be living with for quite a while.

Using Multiple Languages in Your Development Environment

Many modern software engineers now work on multiple projects over the course of their career, each with different requirements.  These requirements often cause us to consider different tools and even languages to get our work done faster and more efficiently.  The Commerce Order Processing and Order Details team decided to take a different approach and decided to integrate multiple languages into its development environment, using Java and the JVM as a base.

The Transition to Groovy

Java is a great language.  It’s well supported, it’s standard in many large corporate enterprises, it’s taught in virtually every single CS/CE program so you have a large pool of talent to draw from.  It allows you to create highly structured programs, and allows code reuse to a large extent.

But sometimes Java is frustratingly inflexible.  Some code that should be dead simple is very complicated by virtue of forcing you down a particular philosophy.  Worse, experienced Java developers tend to create designs that encourage complication and lots of structure.  For a team that wants to be more flexible and agile, this is not good.  Java 8 has created some mechanisms designed to address some of these, but they are in the very early stages.

The Order Details team at Ticketmaster wanted to experiment with design philosophies to encourage faster development and more flexibility; however, we had a large code base that we needed to add features to and we didn’t want to rewrite most of this code.  So instead, we decided to test something out – introducing a language that wouldn’t be as verbose as Java.  Enter Groovy.

Groovy is a language that compiles to the Java Virtual Machine.  Its syntax builds on top of Java, which makes it very good as a transitional language – you rename a file’s extension from .java to .groovy, and it just works.  Groovy has a lot of features that make it attractive: true closures, automatic field access, syntactic sugar to compact code, etc.

This isn’t possible without the JVM itself, which in itself allows you to use multiple languages in the same program.

Add in some Clojure

We needed to add a feature to our Order Details service that involved custom data translation.  The details for that will be in a future blog entry, but eventually we decided to use Clojure as our implementation language, as it’s naturally suited to this type of feature.  A similar feature that existed in one of the older versions of our service was 1-2 orders of magnitude larger in terms of lines of code, so we were motivated to make this change.

However, Clojure as a language is also a more difficult language to adapt.  To help ease the transition, the members of our team decided to read documentation and books on the subject; we also used pair programming heavily to spread information through the team quickly.

It took around 1-2 months to finish the feature; however, the schemas to create the translations were easy to understand, and by and large the engine needs very little modification once it was written, so it was effective.

Advantages to the Approach

The main advantage in this approach is that you can write code that is particularly efficient for a particular problem without having to throw out existing code.  The Clojure code, when written in Java, would have been several times larger and much harder to debug.  In addition, if we needed something that was better to do in Groovy, we wrote it in Groovy.  This included unit and integration tests, which is not well supported in Clojure.  The Groovy change allowed us to slowly transition into a language that allowed us to move faster without having to slow down and learn an entirely new language.

Disadvantages to the Approach

However, languages need to be supported and new team members (whether they’re new hires or other teams working with our code base) need more time to ramp up to the code, especially if they’re from a mainly Java background.  We decided the code for the feature was isolated enough that the transition can be done over time, but it’s still an issue.

In addition, there are certain architectural problems you won’t be able to solve with solely a language change.  In that case, it may be better to start from scratch.

Who should use this approach and why would they benefit?

I don’t think this approach should be used for all teams; but teams with the characteristics below could be well served by this approach:

  • A high degree of autonomy and engineers who understand multiple languages
  • A domain-specific problem set better suited with another language
  • An experimental approach to target a specific technology without leaving their current development environment

 

Conclusion

Using multiple languages in the same development environment can be useful in the case of a specific problem domain or the desire to transition to an approach where they believe they can do work more efficiently.  While this approach uses the JVM as a target compilation platform, Javascript can and has been used as a platform as well.  The approach isn’t for all teams, but it can yield great gains when used correctly.

What Ticketmaster is doing about technical debt

This post describes the journey Ticketmaster has been on over the last year to define and measure technical debt in its software platforms. The issue kept surfacing from multiple sources, and yet as an engineering organisation we had no consensus on how to define technical debt, yet alone measure it or manage it. So we embarked on a journey to answer these questions, and gain agreement across the engineering organisations in order to effectively provide a common approach to solving the problem.

We started with research and found that technical debt is all around us:

fowler-on-debt

A chilling example is of Knight Capital who ran updates on their systems that reactivated some dead code, causing the system to spit out incorrect trades – losing $460 million in under an hour (Source).

Ultimately debt management is a business decision – so how do we as IT professionals source and present the right data to influence the decision makers? Part of articulating the size of the problem was to compare the size of the Ticketmaster codebase to other codebases of a similar size:

tm-tech-debt

Over 4.5 million lines of code are spread across 13 different platforms, from legacy to greenfield, across a whole mix of technology stacks including different flavours of .Net and LAMP. We formed a working group with members from different platforms and locations in order to build a model that would work across all these boundaries, and would have the buy in from all areas. Our research can be summarised as follows:

key-findings

We used these 3 different areas as the top level of categorisation for the following reasons:

Application Debt – Debt that resides in the software package; unchecked build up can lead to:

  • Inflexibility and much harder to modify existing features or add new ones
  • Poor user experience
  • Costly maintenance and servicing

Infrastructure Debt – Debt that resides in the operating environments; unchecked build up can lead to:

  • Exposure to security threats and compliance violations (e.g. PCI compliance)
  • Inability to scale and long queuing times for customers
  • Poor response and recovery times to outages
  • Costly maintenance and servicing

Architecture Debt – Debt that resides in the design of the entire system; unchecked build up can lead to:

  • Software platforms that are highly inflexible and costly to maintain and change
  • Flawed design that can’t meet the requirements of the business
  • Single points of failure which cause outages or exacerbate them
  • Unnecessarily complex systems that can’t be adequately managed
  • This gives opportunities for more nimble rivals to gain competitive advantage through technology

Model

Having selected the areas to measure, we needed a model that could be applied across the huge range of technologies used throughout TM’s platforms. Testing both automated and manual processes with various teams and tools helped to refine the model so it could be applied consistently:

tech-debt-model1

We aimed to automate the collection of as many of the metrics as possible via automated tooling to make the process repeatable:

  • Application Debt:
  • Infrastructure Debt:

Manual Mapping

Where automated tooling hasn’t existed to measure the metrics we identified, a manual process has been introduced to help measure the intangible. We used the following guiding principle:

measure-anything

It was clear that a lot of what was known about the limitations of a platform or component wasn’t being reported by tooling, but was readily available in the minds of engineers. We just needed to figure out a consistent, repeatable and transparent process for extracting that data and making it useful. Out of this need was born the technical debt mapping process:

mapping

Reporting: Telling the story

At this point, we’re left with lots of data. One of the core driving philosophies behind the whole process was to present pertinent information to enable executive level decisions to be made as to how to manage debt from a strategic point of view.

need-to-know

For example, if debt remains stubbornly high for a critical application or area and is reducing developer throughput to a crawl, then a strategy of downing tools to address the debt may be the best option – but this is likely only to be able to be made at executive level. Executives only really require the most pertinent information, but if required the data behind the summary would also need to be readily available. The report is divided into three parts:

parta

Part A (above) contains a rolled up score for each of the three main debt categories for each of the systems being reported on.  It also contains a summary of the debt in the platform which have been contributed by the Engineering, Tech-ops and Architecture groups.

partb

Part B (above) contains a more detailed breakdown of each main category’s definitions for each of the systems being reported on.  This gives the reader a better insight into which definitions have higher or lower debt levels and an indication where work needs prioritising.  It also shows the next items to be worked on.

partc

Part C (above) contains the longer term technical debt backlog of for each of the systems being reported on, broken down by category. There is no indication of time for each item but some could span months or even years.  This section is aimed more towards the Engineering and Architecture teams.

interpret-report

What do we do with the output?

Update Technical backlogs

  • Updated with new debt items as they are identified during the mapping processes
  • Technical debt items prioritised according to criticality, not level of debt – some components may contain a lot of debt but are stable, or no longer strategic

Update Product Roadmaps

  • Selected technical debt items prioritised against product roadmap items
  • Product teams need to be bought into the importance of maintenance work
  • Value needs to be clearly defined and communicated, in order to make the right strategic decisions for scheduling the maintenance work and managing the debt.

What next?

Management of technical debt is more than just about identifying and scheduling maintenance work. With the plan to issue the report quarterly, the intention is also that the visibility the report provides, plus the tooling provided to engineering teams will help to stem and reduce the introduction of technical debt. By adhering to industry best practices, and being conscious of the implications of introducing further debt, engineering teams can take steps to build quality into the products and platforms as they go. The debt that is introduced as projects progress can therefore be minimised, and product and engineering teams can make more informed decisions about how and when to pay debt down.

Ultimately the goal of each of us is to delight our fans. Running well maintained systems will benefit them with better stability, better response times and ultimately faster delivery of features as debt is brought under control, managed and interest rates reduced to a sensible level. Debt management will benefit the whole business in a similar way – less firefighting, fewer outages and platforms that are easier to develop. Ultimately, we all stand to gain.

Taking a Stand Against Overdesign

This post was co-authored by Jean-Philippe Grenier and Sylvain Gilbert. 

Have you ever spent countless hours arguing about how things should be done? Have you ever been in an analysis-paralysis scenario where you were going through so many use cases that you couldn’t figure out the architecture? Have you ever seen an architecture being built around a particular case which we may eventually want to support? Continue reading

The Case of the Recurring Network Timeout

This post was co-authored with Roshan Revankar

At Ticketmaster we’re passionate about monitoring our production systems. As a result, we occasionally come across interesting issues affecting our services that would otherwise go unnoticed. Unfortunately, monitoring only indicates the symptoms of what’s wrong and not necessarily the cause. Digging in deeper and getting to the root cause is a whole different ball game. This is one such example. Continue reading

View from Section – Behind The Scenes

I’m a tennis fan and going to the US Open has become an annual tradition for me. So, I was more than excited when the US Open was among Ticketmaster’s first few events with the View from Section feature. Our aim as always is to provide a great live event experience to fans – allowing fans to check out the view from their seat before they buy tickets is just another part of that mission.

VfS-screenshot

Continue reading

Fear and Paranoia in a DevOps World

DevOps as a philosophy has been gaining momentum among startups and enterprises alike for its potential of delivering high quality features to the customer at a fast pace.  It’s no surprise that at Ticketmaster, we are championing the DevOps model so that we can deliver on our promise of enhancing the fan experience. The DevOps model implies rapid iteration on features and specifically, the focus of this post, frequent production deployments. Continue reading

The Tao of Ticketing

Our ticketing Inventory Service here at Ticketmaster searches for and reserves seats in an event. It exists exclusively for inventory management, unlike the classic Ticketmaster “Host”, which also handles reporting, user management, event programming, and more in addition to seat selection. The Inventory Service is arguably Ticketmaster’s core. It is the base layer underneath the entire Ticketmaster architecture. It is the singular part of our software stack that gives the stack its unique Ticketmaster identity.

It is also a seeming anachronism in an SOA architecture. The Inventory Core (the lowest level component) is written in C++ with assembly bits for hyper-critical sections. It communicates via Google Protocol Buffers; doesn’t scale; and can’t be distributed. Why?

Ticketing is a complex and unique technology and computer science challenge which requires the unique solution we’ve developed. How do we allocate multiple high demand spatially aware contiguous seats in a fair but ultimately performance-oriented manner? Continue reading