DevOps as a philosophy has been gaining momentum among startups and enterprises alike for its potential of delivering high quality features to the customer at a fast pace. It’s no surprise that at Ticketmaster, we are championing the DevOps model so that we can deliver on our promise of enhancing the fan experience. The DevOps model implies rapid iteration on features and specifically, the focus of this post, frequent production deployments.
The premise of frequent small production deployments is that we minimize the risk in each deployment by keeping the changeset small. However, most seasoned developers tend to have a fear of production deployments, cultivated from years of experience. We all have our favorite horror stories – from those pesky bugs that only strike in production to that perfect storm of events leading to catastrophic failure. Increasing the frequency of production deployments can only further fuel the sense of paranoia in developers.
Getting over this fear – making production deployments low-risk and stress-free – is critical for the DevOps model to be successful. Towards that end, we are focusing on three areas – architecture, testing and monitoring. While architecture, testing and monitoring are always important, they merit special consideration in a DevOps organization.
On the architecture front, we have embraced Service Oriented Architecture (SOA) to help us solve the DevOps puzzle. Decomposing a monolithic application into functionally isolated web-services brings a number of benefits to the product development lifecycle. However, it also means that you can deploy incremental improvements to component services without affecting the entire application. Managing a SOA environment is not without its challenges. Leslie Lamport’s famous quote on distributed systems applies equally well to SOA:
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
SOA does require organizational discipline and planning. Our service owners are required to make sure that changes are always backwards compatible. Similarly, we expect service clients to use SOA patterns like Tolerant Reader so that new features provided by the service don’t break existing clients. We also use asynchronous messaging in certain flows to further decouple the systems.
In the event of the dreaded, albeit rare failure, the decoupled nature of SOA allows us to isolate failure to a single feature without crippling the entire site.
With SOA, we now have hundreds of services evolving on independent schedules and being released to production constantly. It’s vital to gain a measure of confidence that a code change in one service does not break a dependent client. In a previous post, William Edmondson did a great job extolling the virtues of Continuous Integration (CI), which in conjunction with Automated Testing is the key as it allows us to rapidly exercise various test suites as builds progress through the pipeline. In addition to running test suites for their own service, test engineers for a service will also run their client’s integration test suites to ensure the sanity of the latest build.
In addition to the CI toolset that William introduced, we also use jenkins and rundeck for continuous integration. Among some of our test tools, we’ve started adopting Behavior Driven Development (BDD) via Cucumber for User Acceptance Testing and Gatling for stress testing.
Testing doesn’t stop when code goes to production. For the truly paranoid developer, nothing works better to quell the anxiety of a production deployment than canary deployments, which typically involve deploying the new code base to a small fraction of production nodes for a period of time while monitoring for issues. Canary deployments help smoke out issues that would not have been caught in an integration environment.
Understanding how your systems and applications are performing in production is essential for the DevOps model to be successful. A robust monitoring and alerting infrastructure becomes all the more critical with SOA as you now have multiple services that need to work together. At Ticketmaster, we use a wide suite of monitoring tools, but Metrilyx – an in house developed visualization application for time series data – takes pride of place.
Metrilyx allows us to monitor system metrics like load, CPU usage, packet activity, IO, application metrics like garbage collection, queue sizes, SLAs and even custom business metrics like number of searches, orders etc., Metrilyx is now an open source project on Github and is layered on top of OpenTSDB. We use Zabbix to send out alerts when metrics exceed pre-determined thresholds.
Metrilyx has been invaluable to us in diagnosing production issues quickly and even alerting us to potential problem areas before they occur.
Taken together, the triumvirate of service oriented architecture, automated testing and monitoring has helped Ticketmaster continue to evolve into an efficient DevOps organization. A large number of service teams in Ticketmaster now follow the DevOps model and are deploying to production at the end of every sprint if not multiple times per sprint. We have also experienced unexpected benefits because of our commitment to DevOps. For instance, the deeper insights into system behavior brought to light because of our monitoring infrastructure has driven many improvements in our architecture.
And we are not done yet! A lot of the initial setup for the pipeline automation, stress testing, monitoring is still manual. We are actively working to simplify a lot of those steps to further encourage developer adoption. Metrilyx as a tool is continuously improving. While we are always discovering new things to monitor and are developing additional libraries to support that, we are also still figuring out the right balance for our alerting. All in all, it’s an exciting time to be working for Ticketmaster and we’re hiring, come join us.