This post describes the journey Ticketmaster has been on over the last year to define and measure technical debt in its software platforms. The issue kept surfacing from multiple sources, and yet as an engineering organisation we had no consensus on how to define technical debt, yet alone measure it or manage it. So we embarked on a journey to answer these questions, and gain agreement across the engineering organisations in order to effectively provide a common approach to solving the problem.
We started with research and found that technical debt is all around us:
A chilling example is of Knight Capital who ran updates on their systems that reactivated some dead code, causing the system to spit out incorrect trades – losing $460 million in under an hour (Source).
Ultimately debt management is a business decision – so how do we as IT professionals source and present the right data to influence the decision makers? Part of articulating the size of the problem was to compare the size of the Ticketmaster codebase to other codebases of a similar size:
Over 4.5 million lines of code are spread across 13 different platforms, from legacy to greenfield, across a whole mix of technology stacks including different flavours of .Net and LAMP. We formed a working group with members from different platforms and locations in order to build a model that would work across all these boundaries, and would have the buy in from all areas. Our research can be summarised as follows:
We used these 3 different areas as the top level of categorisation for the following reasons:
Application Debt – Debt that resides in the software package; unchecked build up can lead to:
- Inflexibility and much harder to modify existing features or add new ones
- Poor user experience
- Costly maintenance and servicing
Infrastructure Debt – Debt that resides in the operating environments; unchecked build up can lead to:
- Exposure to security threats and compliance violations (e.g. PCI compliance)
- Inability to scale and long queuing times for customers
- Poor response and recovery times to outages
- Costly maintenance and servicing
Architecture Debt – Debt that resides in the design of the entire system; unchecked build up can lead to:
- Software platforms that are highly inflexible and costly to maintain and change
- Flawed design that can’t meet the requirements of the business
- Single points of failure which cause outages or exacerbate them
- Unnecessarily complex systems that can’t be adequately managed
- This gives opportunities for more nimble rivals to gain competitive advantage through technology
Having selected the areas to measure, we needed a model that could be applied across the huge range of technologies used throughout TM’s platforms. Testing both automated and manual processes with various teams and tools helped to refine the model so it could be applied consistently:
We aimed to automate the collection of as many of the metrics as possible via automated tooling to make the process repeatable:
- Application Debt:
- Infrastructure Debt:
Where automated tooling hasn’t existed to measure the metrics we identified, a manual process has been introduced to help measure the intangible. We used the following guiding principle:
It was clear that a lot of what was known about the limitations of a platform or component wasn’t being reported by tooling, but was readily available in the minds of engineers. We just needed to figure out a consistent, repeatable and transparent process for extracting that data and making it useful. Out of this need was born the technical debt mapping process:
Reporting: Telling the story
At this point, we’re left with lots of data. One of the core driving philosophies behind the whole process was to present pertinent information to enable executive level decisions to be made as to how to manage debt from a strategic point of view.
For example, if debt remains stubbornly high for a critical application or area and is reducing developer throughput to a crawl, then a strategy of downing tools to address the debt may be the best option – but this is likely only to be able to be made at executive level. Executives only really require the most pertinent information, but if required the data behind the summary would also need to be readily available. The report is divided into three parts:
Part A (above) contains a rolled up score for each of the three main debt categories for each of the systems being reported on. It also contains a summary of the debt in the platform which have been contributed by the Engineering, Tech-ops and Architecture groups.
Part B (above) contains a more detailed breakdown of each main category’s definitions for each of the systems being reported on. This gives the reader a better insight into which definitions have higher or lower debt levels and an indication where work needs prioritising. It also shows the next items to be worked on.
Part C (above) contains the longer term technical debt backlog of for each of the systems being reported on, broken down by category. There is no indication of time for each item but some could span months or even years. This section is aimed more towards the Engineering and Architecture teams.
What do we do with the output?
Update Technical backlogs
- Updated with new debt items as they are identified during the mapping processes
- Technical debt items prioritised according to criticality, not level of debt – some components may contain a lot of debt but are stable, or no longer strategic
Update Product Roadmaps
- Selected technical debt items prioritised against product roadmap items
- Product teams need to be bought into the importance of maintenance work
- Value needs to be clearly defined and communicated, in order to make the right strategic decisions for scheduling the maintenance work and managing the debt.
Management of technical debt is more than just about identifying and scheduling maintenance work. With the plan to issue the report quarterly, the intention is also that the visibility the report provides, plus the tooling provided to engineering teams will help to stem and reduce the introduction of technical debt. By adhering to industry best practices, and being conscious of the implications of introducing further debt, engineering teams can take steps to build quality into the products and platforms as they go. The debt that is introduced as projects progress can therefore be minimised, and product and engineering teams can make more informed decisions about how and when to pay debt down.
Ultimately the goal of each of us is to delight our fans. Running well maintained systems will benefit them with better stability, better response times and ultimately faster delivery of features as debt is brought under control, managed and interest rates reduced to a sensible level. Debt management will benefit the whole business in a similar way – less firefighting, fewer outages and platforms that are easier to develop. Ultimately, we all stand to gain.