Implementing a DevOps Strategy across multiple locations & product teams

Over the last 18 months, a change has begun within the Ticketmaster International Team. Barriers are being broken down between the engineering and operational teams, our different product delivery teams are being aligned and knowledge sharing across teams is happening more and more. What’s changed? We developed a strategy based around DevOps to create a leaner higher performing organisation and our journey is underway.

As with many large mature international companies our organisation is probably not unique; our Product delivery & TechOps teams are distributed across 5 geographical locations: Belgrade (Serbia), Gothenburg (Sweden), London (UK), Quebec (Canada) and Stoke (UK). Across these teams we manage about 15 different platforms. Our challenge was to create a DevOps strategy and implement change in a flexible manor is across all delivery teams.

With any distributed organisation we have suffered from communication barriers, although tooling such as Skype, Slack, Zoom, are all helping to break down the barriers. However, more fundamental issues existed such as terminology, multiple tools being used for the same job, skills and abilities differences between locations, and silos. A further example of silos was with our TechOps team being a separated centralised group, with different goals to the engineering team. When different groups that need to work together are not aligned and have different goals this can cause friction. In this case, because the way we’ve been organised, the multiple concurrent requests coming into TechOps from the various engineering teams has caused problems in their ability to service all teams at the same time which causes delays.

The differences in tooling and processes have created a barrier that slows us all down. We needed a new approach and developing a DevOps strategy has been one of the answers for us.

Our DevOps Strategy

In developing our DevOps strategy we wanted all teams to speak the same language, and have a shared understanding and skills. We wanted to break down the silos that had been built over time, bringing teams closer together and aligning resources to delivering products, so that we can be more agile, nibble, developing and releasing high quality products quickly, efficiently and reliably. Echoing the Agile manifesto principles:

Our highest priority is to satisfy the customer through early and continuous delivery of valuable software – Principle #1

Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale – Principle #3

Coalescing our ambitions and desires, mindful of the agile manifesto principles we defined 4 main objectives for our DevOps strategy:

  • Maximise for delivering business value.
  • Maximise the efficiency and quality of the development process.
  • Maximise the reliability of applications and environments.
  • Maximise for service delivery.

With these objectives we started to define requirements to achieve them. Quickly we ran into a mountain of requirements and with that a prioritisation nightmare: how to prioritise the requirements across 5 global locations and 15+ delivery teams, each with different needs.

The Maturity Models

After several rounds of attempting to prioritise in a sensible way we began to arrange the requirements into themes and with that a Maturity Model evolved; one maturity model for each objective.

Maximise for delivering business value. This goal is centred on continuous delivery, creating fast runways down which we can launch our applications.

devops-strategy-image00

Maximise the efficiency and quality of the development process. This goal is centred on continuous integration, creating the environment to launch a battery of automated tests and gain fast feedback to be able to evolve code.

devops-strategy-image01

Maximise the reliability of applications and environments. This goal is centred on instrumentation, creating the visibility into the inner workings of our applications for root cause analysis and fault tolerance.

devops-strategy-image02

Maximise for service delivery. This goal is centred on organisational change, creating alignment of cross-functional teams responsible for delivering software.

devops-strategy-image03

The Maturity Models are great; they provide a vision of what our strategy is. Defining the steps required to progress through to achieving advanced levels of DevOps, we can set long term and short term targets on different themes or levels to be reached.  They’re modular so we can change the strategy if improved technology or processes become apparent, and fill in gaps where some exist.

Flexible Planning

The nice thing about the maturity models is the flexibility they provide. They are maps that can guide you from a low maturity to a high maturity level of DevOps.  If you imagine how you would use maps to plan a route from A to B, depending on various conditions, such as day of week, time of day, potential traffic density, road speed, road works, etc the routes chosen will be most appropriate given a set of circumstances.

devops-strategy-image04

The DevOps maturity models are true roadmaps, as opposed to a linear list of requirements, allowing each individual delivery team to navigate their own path dependent on their context based on what is most important to them or what concerns they have at any point in time.  Furthering this flexibility, the Maturity Models allow teams to change their routes and reprioritise their plans in consort with business changes and needs.

When individual teams select and complete a portion of the maturity model no other team has yet reached comes with an additional benefit. The problems solved by those teams can be shared with all other teams allowing them to achieve that same work faster avoiding the potential pitfalls that would have been learnt by the early adopting team.

Even though all product delivery teams have the flexibility to select their own routes to achieving our DevOps objectives, ultimately everyone ends up at the same location. So the maturity models enable various programs of work to be planned across different teams with very different needs and abilities.

Standardisation

As good as our maturity models are they weren’t able to solve a couple of issues which still existed: we’re using multiple tools to do the same jobs and we speak different languages because we use different terminology for the same things. To solve this prior to kicking off our strategy we set up focused working groups to define and agree a set of standards for tooling, definition of terms (e.g. naming conventions), best practices (e.g. code reviews) and core specifications (e.g. Logging, Heartbeats & Health checks).

Our Core Tooling

  •         Git – Source Control
  •         GitLab – Git Management & Code Reviews
  •         Jenkins – Application Builds
  •         SonarQube – Code Quality Reporting
  •         Sonatype Nexus – Package Management
  •         Rundeck – Operational Support
  •         Octopus Deploy – Deployment (Windows only)
  •         Chef – Configuration Management

Standardising our tooling and specifications for implementing instrumentation meant we could reduce support overheads, share knowledge and solve problems once. Guidelines and best practices meant we’re working in the same ways and all have shared understanding. Definition of Terms meant we could all speak the same language and avoid confusion.

With the maturity models and standards we have created a shared vision and enabled flexibility for each product delivery team to plan what they want to work on. We have created a framework that enables all product delivery teams start working on achieving the DevOps objectives in parallel but focusing on what’s important to their needs at any given point in time.

2015 London tmTechEvent – Ticketmaster’s in-house tech conference

hotel

“Welcome to the tmTechEvent 2015,” John McIntyre, head of PMO at Ticketmaster International announced as he scanned my badge. The scanner beeped appreciatively as it recognised my credentials and I was granted entry – just like going to the O2! Apart from the fact that this was our annual Ticketmaster technology summit meeting in a conference room just up the road from our offices at the edge of London’s Silicon Roundabout. But this wasn’t any old conference. Being in the live entertainment space, we had quizzes (using Kahoot.it), a live Twitter feed (@TMTechEvent) and a party in our London HQ’s basement bar complete with gourmet burger van!

As I entered the room there was a palpable sense of tension and nervous excitement in the air as I greeted my colleagues from the Sports division: with the Rugby World Cup starting on Friday, their systems would be under the spotlight – or was it just a matter of doubt over England’s ability to deal with a tricky opening tie with Fiji? Ultimately both fears proved unfounded – all the events ran smoothly and England prevailed.

Along with leaders from all of Ticketmaster’s other technology teams, we had joined together to take stock of our progress in revamping our technology real estate. The 4 day event was packed full of seminars, workshops and group sessions, with the overall aim of evaluating our strategies and determining where to correct our course. We followed the guiding principle of “focus where you want to go, not what you fear.”

wheretogoquote

We pulled in leaders from all over the business to help shape our vision of where we’re going. We combined this with focused feedback sessions on the various aspects of our strategy to determine the best way to pivot to adapt to the changing landscape.  Using workshops to facilitate a rich exchange of ideas we covered subjects from staff satisfaction to talent management to deeper technical subject matter such as engineering KPIs and creating reference architectures.

Other highlights included live demos of in-house tools, including one that had been created to show the visibility of progress in our DevOps program. It was really cool to see each team’s progress in one place across our home grown four- part maturity model. Even better was sharing the whole event with colleagues across Engineering and Operations and feeling a real sense of unity, proving that good DevOps is about culture change and not just a bunch of new processes!

conference 2015

The sheer scope and depth of material covered was brain bending. We rolled out our new career mapping program, providing a structured career map and promotion process across all of our engineering teams. We had a thorough review of the initial results of our innovative and much talked about technical debt management program that we rolled out at the start of the year. We reviewed the progress being made on our employee feedback survey, to ensure that the concerns of our engineers are being taken seriously (it’s not just about having more opportunity to play ping pong!)

Overall this was an inspiring event. There was a tangible confidence and will to achieve our ambitions, based on a very real sense of achievement from how much we had already changed things for the better in Ticketmaster Engineering. The desire to increase collaboration and tackle the bigger challenges ahead together was strong.

Having reset our sights on the vision of the organisation, our technology vision and our engineering vision, the result was a noticeably energised and motivated group of rock-star tech leaders ready to take that vision back out to their teams and the company. The future of better live entertainment starts here!

Experimenting With Efficiency

Have you ever felt bogged down by the weight of process? I’m experimenting with increasing efficiency and reducing workload on my team at Ticketmaster by applying lessons from Gene Kim’s book on devops – “The Phoenix Project”. By learning and implementing what the book calls, “The Three Ways”, we hope to drastically increase our productivity and quality of code, all while reducing our workload.

The First Way is defined as “understanding how to create fast flow of work as it moves from one work center to another.”1 A ‘work center’ can be either a team or an individual who has a hand in working as a part of a larger process. A major part of creating fast flow of work relies on improving the process of hand-offs between different teams. By working on improving visibility of the flow of work, one is able to both get a better understanding of the current workflow and identify which work centers act as bottlenecks.

The Second Way, “shortening and amplifying feedback loops, so we can fix quality at the source and avoid work”2 is about being able to understand and respond to the needs of internal and external customers. In order to shorten feedback loops, one should find ways to reduce the number of work centers or the number of steps it takes to complete a task (including but not limited to combining teams, removing steps altogether, or automating certain processes). The other part of the Second Way requires reducing work at the bottlenecks or otherwise finding ways to remove work from the system, so that the feedback for the work left in the system can be emphasized.

The Third Way is to “create a culture that simultaneously fosters experimentation, learning from failure, and understanding that repetition and practice are the prerequisites to mastery”3. Major components of the Third Way include allocating time for the improvement of daily work, introducing faults into the system to increase resilience, and creating rituals, such as code katas or fire drills, that can expose people to new ways of doing things, or help them master the current system.

In our workplace, we are working on applying the the Three Ways to improve our daily lives. Currently, we have several teams from different geographical locations working on the same codebase. Initially, this led to many dependency conflicts, lots of tasks being blocked by other teams, and there were many issues regarding communication.

We recently started using a kanban board in order to give us better visibility into our workflow, and have added a column on the board for every hand-off between teams. The focus is now on finding ways to reduce the wait time between columns. We have put together checklists in order to aid with communication and improve quality, so that wait times might be reduced. Simultaneously, we are working on ways to remove our reliance on other teams for things such as code review. There are still problems regarding story blockers, but it is hoped that these problems can be solved in the long run by either re-structuring the team’s responsibilities to match the system’s design, or vice versa.

kanban2

Figure 1. Our team’s kanban board

Applying the Three Ways is still a work in progress, but we are already seeing benefits.  Whereas before our product people were creating tasks faster than us developers could work on them, now they are scrambling to keep up with us. Although we still have to deal with stories from other teams acting as blockers, the flow of communication has greatly improved, and the decrease in wait time has been noticeable. Any time gained by our team is being used for “10% time”, which is time dedicated to either research or tasks that will help our team improve daily work and overall efficiency.


1. Gene Kim, The Phoenix Project, Page 89
2. Gene Kim, The Phoenix Project, Page 89
3. Gene Kim, The Phoenix Project, Page 90

Symptom-Based Monitoring at Ticketmaster

monitoring_dash
When Rob Ewaschuk – a former SRE at Google – jotted down his philosophy on alerting, it resonated with us almost immediately. We had been trying to figure out our alerting strategy around our then relatively new Service-Oriented Architecture – the term microservices hadn’t quite entered the zeitgeist at the time.

It’s not that we didn’t we didn’t have any alerting. In fact, we had too many – running the gamut from system alerts like high cpu, low memory to health check alerts. However, these weren’t doing the job for us. In a system that is properly load balanced, a single node having high cpu does not necessarily mean the customer is impacted. More so, in an SOA architecture, a single bad node in one service is extremely unlikely to result in a customer-impacting issue. It’s no surprise then that with all the alerting we had, we still ended up having multiple customer-impacting issues that were either detected too late or – even worse – by customer support calls.

Rob’s post hit the nail on the head with his differentiation of “symptom-based monitoring” vs “cause-based monitoring”:

I call this “symptom-based monitoring,” in contrast to “cause-based monitoring”. Do your users care if your MySQL servers are down? No, they care if their queries are failing. (Perhaps you’re cringing already, in love with your Nagios rules for MySQL servers? Your users don’t even know your MySQL servers exist!) Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop? No, they care if their features are failing. Do they care if your data push is failing? No, they care about whether their results are fresh.

It was obvious to us that we had to change course and focus on the symptoms rather than the causes. We started by looking at what tools we had at our disposal to get symptom-based monitoring up and running as soon as possible. At the time, we were using Nimbus for alerting, Open TSD for time series data and then we had Splunk. Splunk is an industry leader for aggregating machine data – typically log files – and deriving business and operational intelligence from that data. We had always used Splunk for business analytics and for searching within logs while investigating production issues but we had never effectively used Splunk for alerting us to those issues in the first place. For a symptom-based monitoring tool, Splunk now stood out as an obvious candidate for the following reasons:

  • Since Splunk aggregates logs from multiple nodes, it is possible to get a sense of the scale and scope of the issue.
  • It also allowed us to set up alerting based on our existing logs without requiring code changes. Though, over time, based on what we learnt, we did enhance our logging to enable additional alerts.

Since the objective was to alert on issues that impact the user, we started by identifying user flows that were of most importance to us, e.g., add to cart, place order, and add a payment method. For each flow, we then identified possible pain points like errors, latency and timeouts, and defined appropriate thresholds. Rob talks about alerting from the spout, indicating that the best place to set up alerts is from the client’s perspective in a client server architecture. For us, that was the front end web service and the API layer that our mobile apps talk to. We set up most of our symptom-based alerts in those layers.

When our symptom-based alerts first went live, we used a brand-spanking new technology called email – we simply sent these alerts out to a wide distribution of engineering teams. Noisy alerts had to be quickly fine-tuned and fixed since there is nothing worse than your alerts being considered as spam. Email worked surprisingly well for us as a first step. Engineers would respond to alerts and either investigate it themselves or escalate to other teams for resolution. It also had an unintentional benefit because there was greater visibility among different teams about the problems in the system. But alerts by email only goes so far – they don’t do well when issues occur outside of business hours they are easy to miss amidst the deluge that can hit an inbox, and there is no reliable tracking.

We decided to use PagerDuty as our incident management platform. Setting up on-call schedules and escalation policies in PagerDuty was a breeze and our engineers took to it right away – rather unexpected for something meant to wake you up in the middle of the night. Going to email allowed us to punt on a pesky conundrum – in a service oriented architecture, who do you page? But we now need to solve that problem. For some issues, we can use the error code in the alert to determine which service team has to be paged. But other symptom based alerts – for example latency in add to cart – could be caused by any one of the services participating in that flow. We ended up with somewhat of a compromise: For each user flow, we identified a primary team and a secondary team based on which of the services had the most work in that flow. For example, for the add to cart flow, the Cart Service could have been primary and the Inventory Service might be secondary. In PagerDuty, we then set up escalation policies that looked like this:

PagerDuty Escalation

Another key guideline – nay, rule that Rob calls out – is that pages must be actionable. An issue we’ve occasionally had is that we get a small spike of errors that is enough to trigger an alert but doesn’t continue to occur. These issues need to be tracked and looked into, but they don’t need the urgency of a page. This is another instance where we haven’t really found the best solution, but we found something that works for us. In Splunk, we set the trigger condition based on the rate of errors:

splunk-alert

The custom condition in the alert is set to:

stats count by date_minute|stats count|search count>=5

The “stats count by date_minute” tabulates the count of errors for each minute. The next “stats count” counts the number of rows in the previous table. And finally, since we’re looking at a 5 minute span, we trigger the alert when the number of rows is 5 implying that there was at least one error in each minute. This obviously does not work well for all use cases. If you know of other ways to determine if an error is continuing, do let us know in the comments.

This is just the beginning and we’re continuing to evolve our strategies based on what we learn in production. We still have work to do around improving tracking and accountability of our alerts. Being able to quickly detect the root cause once an alert fires is also something we need to get better at. Overall, our shift in focus to symptom-based alerting has paid dividends and has allowed us to detect issues and react faster, making the site more stable and providing a better experience for our fans. Doing this while ensuring that our developers don’t get woken up by noisy alerts also makes for happier developers.

Tools Shaping Culture

Ticketmaster’s Mark Maun recently presented at the Southern California Linux Expo on how great tools can actually be a driving factor for cultural change at scale. Ticketmaster’s DevOps culture has gone through transformative change largely through the use of open source tools. In Mark’s SCALE 13x presentation, Mark walks you through the motivations for change and shares examples of how great tooling has impacted Ticketmaster’s ability to increase product velocity and overall system reliability at scale. Mark’s presentation starts at 3:43:00.

You can see an expanded description of the presentation from the SCALE website here.