Ticketmaster and Button: A Mobile Commerce Experience

Ticketmaster is hard at work creating world-class mobile consumer experiences, but even when you are a major player in ticketing, finding engaged audiences with intent is always a challenge. We need to continually be there for our fans while they are in the process of exploring the things they love – bands, sports teams, theater – in best-of-breed publishers, and find a way to seamlessly deliver them into our apps. Apps that are iterating at what feels like warp speed.

In the last year alone, we have released 11 versions of our iOS app. During this time we have delivered: a Universal version (iPad compatibility), In-Venue Seat Upgrades, ApplePay, Search Suggest, iCloud Account Sync, Seat/Section Preview, Sign-In after Offering, Accepting Transferred Tickets, Camera-Scanning Credit Cards, and iOS9 App-Content Searching.

As a global market leader and incumbent in the space, we are continually finding better ways to iterate, test and evolve more quickly. This doesn’t always mean developing technology in-house. We frequently test and implement exciting new ideas in the marketplace through third-parties. Through one partnership with Button, we get access to many other companies that are philosophically aligned with our own customer acquisition goals.

Fans use a variety of mobile apps to consume content like music, videos, news, or sports scores and stats. Most apps focus on a particular form of media, like music streaming, or a particular group of fans, like hockey fans. This leads to a fragmented mobile commerce marketplace, that is something we’re constantly thinking about and developing for. For example: How do we enable discovery of related content across many disparate apps?

Button provides a deep-linking connective tissue between these disparate apps.In fact, Button’s integration techniques are actually quite straightforward and easy to use. This is how it works:

Let’s say we have a Great Music App that provides some amazing music streaming services. That app may want to offer users a button that links to more content by a band. The content could be videos, news, or in our case: concert tickets. In the app, this “button” can be created using the Button SDK. This is a subclass of UIControl which includes some code that creates a deep link into an external app. Button will also provide an Affiliate ID to make sure this app gets credit for any purchases made due to the link.

There is a little bit of a tricky part here: First, we need some kind of common name or ID for the artist to make sure we land the user in the right place in the linked app. Second, depending on whether the linked app is installed or not, your button could open a link to the Apple AppStore or a deep-link directly into the external app.

The deep link opens the Ticketmaster app directly onto a page listing upcoming concerts by the specified artist. If the user then purchases tickets, the Order ID and Amount are sent to Button along with the original app’s Affiliate ID. It’s convenient for us and secure for the fan.

Music App Example:

button_flow

From the linked-app side, everything is very simple. The app only needs to handle two events:

  1. App has opened with a deep-link and an affiliate ID
  2. User has placed an Order (bought tickets)

We handle the app opening in the appDelegate. All we really need to do here is store the Affiliate ID for later. The Button SDK can help here:

button_ad_code

Next, when a purchase is completed, we look to see if we have a stored Affiliate ID and send it to Button along with the Order Number and Price. The Button SDK handles this for us as well:

button_order_code

So two lines of code and done!

Observations:

Now, given how simple these operations are, you might question the need for the Button SDK at all. Button has already thought of this and also provided a simple network API that your app can call directly to get everything you need. The API is a little more code, but allows the transparency and flexibility needed to make sure Button integrates perfectly with your existing security and coding standards.

I feel like Button could do a little more to solve that tricky business I mentioned earlier in the originating app, but for the linked app, their implementation couldn’t be simpler.

Button has provided an amazing first step to solving the big problem of linking content across the diversity of media-rich apps found on mobile devices today and in the future. The future of the mobile commerce marketplace – better for the developer and better for the fan.

Experimenting With Efficiency

Have you ever felt bogged down by the weight of process? I’m experimenting with increasing efficiency and reducing workload on my team at Ticketmaster by applying lessons from Gene Kim’s book on devops – “The Phoenix Project”. By learning and implementing what the book calls, “The Three Ways”, we hope to drastically increase our productivity and quality of code, all while reducing our workload.

The First Way is defined as “understanding how to create fast flow of work as it moves from one work center to another.”1 A ‘work center’ can be either a team or an individual who has a hand in working as a part of a larger process. A major part of creating fast flow of work relies on improving the process of hand-offs between different teams. By working on improving visibility of the flow of work, one is able to both get a better understanding of the current workflow and identify which work centers act as bottlenecks.

The Second Way, “shortening and amplifying feedback loops, so we can fix quality at the source and avoid work”2 is about being able to understand and respond to the needs of internal and external customers. In order to shorten feedback loops, one should find ways to reduce the number of work centers or the number of steps it takes to complete a task (including but not limited to combining teams, removing steps altogether, or automating certain processes). The other part of the Second Way requires reducing work at the bottlenecks or otherwise finding ways to remove work from the system, so that the feedback for the work left in the system can be emphasized.

The Third Way is to “create a culture that simultaneously fosters experimentation, learning from failure, and understanding that repetition and practice are the prerequisites to mastery”3. Major components of the Third Way include allocating time for the improvement of daily work, introducing faults into the system to increase resilience, and creating rituals, such as code katas or fire drills, that can expose people to new ways of doing things, or help them master the current system.

In our workplace, we are working on applying the the Three Ways to improve our daily lives. Currently, we have several teams from different geographical locations working on the same codebase. Initially, this led to many dependency conflicts, lots of tasks being blocked by other teams, and there were many issues regarding communication.

We recently started using a kanban board in order to give us better visibility into our workflow, and have added a column on the board for every hand-off between teams. The focus is now on finding ways to reduce the wait time between columns. We have put together checklists in order to aid with communication and improve quality, so that wait times might be reduced. Simultaneously, we are working on ways to remove our reliance on other teams for things such as code review. There are still problems regarding story blockers, but it is hoped that these problems can be solved in the long run by either re-structuring the team’s responsibilities to match the system’s design, or vice versa.

kanban2

Figure 1. Our team’s kanban board

Applying the Three Ways is still a work in progress, but we are already seeing benefits.  Whereas before our product people were creating tasks faster than us developers could work on them, now they are scrambling to keep up with us. Although we still have to deal with stories from other teams acting as blockers, the flow of communication has greatly improved, and the decrease in wait time has been noticeable. Any time gained by our team is being used for “10% time”, which is time dedicated to either research or tasks that will help our team improve daily work and overall efficiency.


1. Gene Kim, The Phoenix Project, Page 89
2. Gene Kim, The Phoenix Project, Page 89
3. Gene Kim, The Phoenix Project, Page 90

Ticketmaster’s Interactive Seat Map Technology from Flash to the Future

If you have a mobile device or read the news lately, you may have noticed that there are issues with browser plug-ins such as the Flash Player. Visiting a website developed with Flash can cause security issues if your plug-in is not up-to-date or, if it is disabled or not available, you can experience reduced functionality.

Currently, the Interactive Seat Map (ISM) feature on our website, ticketmaster.com, is powered at its core by a Flash component. We are doing our best to continue to ensure a smooth and safe ticket buying experience in this rapidly changing environment, such as making sure click-to-play works against our current Flash ISM. At the same time, we are researching and developing multiple new rendering technologies – from building a JavaScript SVG and HTML5-compatible ISM, to an OpenGL ISM for use in native mobile applications, to server-side rendering technology. Further, these tools will give our clients access to customized seat maps for reports and to power our fan views when there is not a need for interaction.

Fans should begin seeing these improvements today on trial events through our mobile website and in the coming months in our new responsive website that we are sending traffic to for some events. Until then, I hope everyone who wants to pick their seat and get a ticket enjoys our distinctive feature of being able to see and select not only verified but exact seats using the ISM.

I know I do, as I used it to purchase three Chicago White Sox tickets two days before a game in July for a family outing. We were able to pick the seats we wanted in row two of the Chris Sale K-Zone section, and see Chris Sale beat Mark Buehrle in a two-hour game in perfect seats. What a great time!

ism

For help with the interactive seat map, please see our FAQ.


Brad Bensen is a Software Architect for the Inventory domain.

Symptom-Based Monitoring at Ticketmaster

monitoring_dash
When Rob Ewaschuk – a former SRE at Google – jotted down his philosophy on alerting, it resonated with us almost immediately. We had been trying to figure out our alerting strategy around our then relatively new Service-Oriented Architecture – the term microservices hadn’t quite entered the zeitgeist at the time.

It’s not that we didn’t we didn’t have any alerting. In fact, we had too many – running the gamut from system alerts like high cpu, low memory to health check alerts. However, these weren’t doing the job for us. In a system that is properly load balanced, a single node having high cpu does not necessarily mean the customer is impacted. More so, in an SOA architecture, a single bad node in one service is extremely unlikely to result in a customer-impacting issue. It’s no surprise then that with all the alerting we had, we still ended up having multiple customer-impacting issues that were either detected too late or – even worse – by customer support calls.

Rob’s post hit the nail on the head with his differentiation of “symptom-based monitoring” vs “cause-based monitoring”:

I call this “symptom-based monitoring,” in contrast to “cause-based monitoring”. Do your users care if your MySQL servers are down? No, they care if their queries are failing. (Perhaps you’re cringing already, in love with your Nagios rules for MySQL servers? Your users don’t even know your MySQL servers exist!) Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop? No, they care if their features are failing. Do they care if your data push is failing? No, they care about whether their results are fresh.

It was obvious to us that we had to change course and focus on the symptoms rather than the causes. We started by looking at what tools we had at our disposal to get symptom-based monitoring up and running as soon as possible. At the time, we were using Nimbus for alerting, Open TSD for time series data and then we had Splunk. Splunk is an industry leader for aggregating machine data – typically log files – and deriving business and operational intelligence from that data. We had always used Splunk for business analytics and for searching within logs while investigating production issues but we had never effectively used Splunk for alerting us to those issues in the first place. For a symptom-based monitoring tool, Splunk now stood out as an obvious candidate for the following reasons:

  • Since Splunk aggregates logs from multiple nodes, it is possible to get a sense of the scale and scope of the issue.
  • It also allowed us to set up alerting based on our existing logs without requiring code changes. Though, over time, based on what we learnt, we did enhance our logging to enable additional alerts.

Since the objective was to alert on issues that impact the user, we started by identifying user flows that were of most importance to us, e.g., add to cart, place order, and add a payment method. For each flow, we then identified possible pain points like errors, latency and timeouts, and defined appropriate thresholds. Rob talks about alerting from the spout, indicating that the best place to set up alerts is from the client’s perspective in a client server architecture. For us, that was the front end web service and the API layer that our mobile apps talk to. We set up most of our symptom-based alerts in those layers.

When our symptom-based alerts first went live, we used a brand-spanking new technology called email – we simply sent these alerts out to a wide distribution of engineering teams. Noisy alerts had to be quickly fine-tuned and fixed since there is nothing worse than your alerts being considered as spam. Email worked surprisingly well for us as a first step. Engineers would respond to alerts and either investigate it themselves or escalate to other teams for resolution. It also had an unintentional benefit because there was greater visibility among different teams about the problems in the system. But alerts by email only goes so far – they don’t do well when issues occur outside of business hours they are easy to miss amidst the deluge that can hit an inbox, and there is no reliable tracking.

We decided to use PagerDuty as our incident management platform. Setting up on-call schedules and escalation policies in PagerDuty was a breeze and our engineers took to it right away – rather unexpected for something meant to wake you up in the middle of the night. Going to email allowed us to punt on a pesky conundrum – in a service oriented architecture, who do you page? But we now need to solve that problem. For some issues, we can use the error code in the alert to determine which service team has to be paged. But other symptom based alerts – for example latency in add to cart – could be caused by any one of the services participating in that flow. We ended up with somewhat of a compromise: For each user flow, we identified a primary team and a secondary team based on which of the services had the most work in that flow. For example, for the add to cart flow, the Cart Service could have been primary and the Inventory Service might be secondary. In PagerDuty, we then set up escalation policies that looked like this:

PagerDuty Escalation

Another key guideline – nay, rule that Rob calls out – is that pages must be actionable. An issue we’ve occasionally had is that we get a small spike of errors that is enough to trigger an alert but doesn’t continue to occur. These issues need to be tracked and looked into, but they don’t need the urgency of a page. This is another instance where we haven’t really found the best solution, but we found something that works for us. In Splunk, we set the trigger condition based on the rate of errors:

splunk-alert

The custom condition in the alert is set to:

stats count by date_minute|stats count|search count>=5

The “stats count by date_minute” tabulates the count of errors for each minute. The next “stats count” counts the number of rows in the previous table. And finally, since we’re looking at a 5 minute span, we trigger the alert when the number of rows is 5 implying that there was at least one error in each minute. This obviously does not work well for all use cases. If you know of other ways to determine if an error is continuing, do let us know in the comments.

This is just the beginning and we’re continuing to evolve our strategies based on what we learn in production. We still have work to do around improving tracking and accountability of our alerts. Being able to quickly detect the root cause once an alert fires is also something we need to get better at. Overall, our shift in focus to symptom-based alerting has paid dividends and has allowed us to detect issues and react faster, making the site more stable and providing a better experience for our fans. Doing this while ensuring that our developers don’t get woken up by noisy alerts also makes for happier developers.

Designing an API That Developers Love

It’s an exciting time at Ticketmaster. The company is growing and innovating faster than ever. We’re rolling out new products, most recently our client-facing Ticketmaster ONE, as well as experimenting with new concepts at a very high cadence.

A big part of that agility is attributed to our API (what’s an API?).

To meet the high demand for growth and innovation, and given the sheer size of our company, API development at Ticketmaster is distributed across many teams in various international locations. That makes it all the more important, albeit difficult, for us to speak the same language as we develop this critical capability. We’re at a point where we need principles and guidelines for developing a world-class API that delights both internal and external developers.

Yes, we will be opening up our APIs to the larger developer community soon. I know, I’m stoked too! More on that in a later post 🙂

API Design Principles

So in order to get our decentralized engineering team to build APIs that look and feel like they came out of the same company, we need to establish certain API design principles. If you dig deep into APIs with strong and loyal developer following (i.e. Amazon, Stripe, Flickr, Edmunds, etc), you’ll notice that they follow what I like to call the PIE principle: Predictable, Intuitive and Efficient APIs.

1. Predictable

They behave in a way that’s expected and do it in a consistent manner. No surprises. No Gotchas. Software is a repeatable process and a predictable API makes it easy to build software. Developers love that.

2. Intuitive

They have a simple and easy interface and deliver data that’s easy to understand. They are “as simple as possible, but not simpler,” to quote Einstein. This is critical for onboarding developers. If the API isn’t easy to use, they’ll move on to the competitor’s.

3. Efficient

They ask for the required input and deliver the expected output as fast as possible. Nothing more, nothing less.

These are APIs that make sense. That’s why they delight and engage developers. Documentation, code samples and SDKs are important, especially to external developers, but the real battle here is ensuring the API itself is as easy as PIE.

API Design Guidelines

To ensure our own API is PIE-compliant, we’ll need to address and reconcile the following areas across all our API development:

1. Root URL

This should be the easiest one to address. All Ticketmaster APIs should have the same root URL. Something like https://app.ticketmaster.com OR https://api.ticketmaster.com. One or the other.

// Good API
https://app.ticketmaster.com/endpoint1/
https://app.ticketmaster.com/endpoint2/
https://app.ticketmaster.com/endpoint3/
// Bad API
https://app.ticketmaster.com/endpoint1/
https://www.ticketmaster.com/api/endpoint2/
https://api.ticketmaster.com/endpoint3/

At a global company like ours, some could argue that we need a separate root URL per market (i.e. US, EU, AU, etc). Logically, that makes sense. But from a developer experience perspective, it’s better to put the localization in the URI path, which is what we’ll discuss next.

2. URI Path

Agreeing on a URI path pattern is going to be one of the most critical decisions our team will have to make. This will heavily impact how predictable, intuitive and efficient our API is. For Ticketmaster, I think the following pattern makes sense:

/{localization}/{resource}/{version}/{identifiers}?[optional params]

localization: The market whose data we’re handling (i.e. us, eu, au, etc)
resource: The domain whose data we’re handling (i.e. artists, leagues, teams, venues, events, commerce, search, etc)
version: The version of the resource NOT the API.
identifiers: The required parameters needed to get a valid response from this API call
optional params: The optional parameters needed to filter or transform the response.

I believe this pattern could help us create endpoints that make sense and are PIE-compliant. Here’s some examples:

// sample endpoints
/us/commerce/v1/cart/create
/us/commerce/v1/ticket/22355050403
/us/artists/v1/taylor+swift
/au/artists/v1/all
/ae/events/v1/all
/us/leagues/v1/nfl/all

What matters here is not the URI pattern itself, but rather sticking to one pattern across all endpoints, which helps make the API predictable and intuitive for developers.

3. HTTP Status Codes

The most important guideline for HTTP header usage in an API context is ensuring the API response status code is a) accurate, and b) matches the response body. This is key in making the API predictable to developers since status codes are the standard in communicating the status of the API response and whether or not a problem has occurred. The main status codes that need to be implemented are:

/ 200 OK
/ 201 CREATED
/ 204 NO CONTENT
/ 400 INVALID REQUEST
/ 401 UNAUTHORIZED
/ 404 NOT FOUND
/ 500 INTERNAL SERVER ERROR

We might also want to define some custom status codes around API quota limits, etc. Whatever we end up deciding, we’ll make sure it’s consistent across all our endpoints.

4. Versioning

Versioning is essential to any growing API like ours. It’ll help us manage any backward incompatible changes to the API interface or response. Versioning should be used judiciously as a last resort when backward compatibility cannot be maintained. Here are some guidelines around versioning:

  • As mentioned earlier, make the API version part of the API URI path instead of the Header to make version upgrades explicit and to make debugging and API exploration easy for developers.
  • The API version will be defined in the URI path using prefix ‘v’ with simple ordinal numbers e.g v1, v2.
  • Dot notations will not be used i.e v1.1, v1.2.
  • First deployment will be released as version v1 in the URI path.
  • Versions will be defined at the resource level, not at the API level.

Versioning eliminates the guessing game, making a developer’s life much easier.

5. Payload Spec

Another key area affecting PIE compliance is using a payload that developers can easily understand and parse. Luckily, JSON API offers a standard specification for building APIs in JSON:

If you’ve ever argued with your team about the way your JSON responses should be formatted, JSON API is your anti-bikeshedding weapon.

By following shared conventions, you can increase productivity, take advantage of generalized tooling, and focus on what matters: your application.

Clients built around JSON API are able to take advantage of its features around efficiently caching responses, sometimes eliminating network requests entirely.

Sold! JSON API is well supported with many client libraries, which is guaranteed to put a smile on any developer’s face. It did on mine 🙂

So what about XML? Are we going to support it? I personally think it’s time to say goodbye to XML. It’s verbose and hard to read, which makes it a major buzz kill for any developer. Also, XML is losing market share to JSON. It’s time. Goodbye, XML.

I’d like to call out a few things in the JSON API spec that we should pay close attention to:

5.1 Links and Pagination

A hypermedia API is discoverable and easy to program against, which in turn gets it closer to being PIE-compliant. The links spec in JSON API helps with that. For data collections, providing a standard mechanism to paginate through the result set is very important, and that’s also done via links.

A server MAY choose to limit the number of resources returned in a response to a subset (“page”) of the whole set available.

A server MAY provide links to traverse a paginated data set (“pagination links”).

Pagination links MUST appear in the links object that corresponds to a collection. To paginate the primary data, supply pagination links in the top-level links object. To paginate an included collection returned in a compound document, supply pagination links in the corresponding links object.

The following keys MUST be used for pagination links:

  • first: the first page of data
  • last: the last page of data
  • prev: the previous page of data
  • next: the next page of data

Keys MUST either be omitted or have a null value to indicate that a particular link is unavailable.

Concepts of order, as expressed in the naming of pagination links, MUST remain consistent with JSON API’s sorting rules.

The page query parameter is reserved for pagination. Servers and clients SHOULD use this key for pagination operations.

5.2 Sorting

The spec on sorting is as follows: use sort query parameters with fields separated by commas. All sorts are by default ascending unless prefixed by “-“, in which case it’s descending.

// Examples of sort
/us/events/v1/all?sort=artist,-date
/us/artists/v1/323232/reviews?sort=-rating,date

5.3 Filtering

Using filters to control the result set of the API response is a great way for us to deliver an efficient API to our developers. We’ll need to discuss our filtering strategy as a team before deciding on how to do it.

5.4 Error Handling

Eventually, things will go wrong. A timeout, a server error, data issues, you name it. Part of being a predictable API is communicating errors back to the developer with some actionable next steps. The error object spec in JSON API helps with that:

Error objects provide additional information about problems encountered while performing an operation. Error objects MUST be returned as an array keyed by errors in the top level of a JSON API document.

An error object MAY have the following members:

  • id: a unique identifier for this particular occurrence of the problem.
  • links: a links object containing the following members:
    • about: a link that leads to further details about this particular occurrence of the problem.
  • status: the HTTP status code applicable to this problem, expressed as a string value.
  • code: an application-specific error code, expressed as a string value.
  • title: a short, human-readable summary of the problem that SHOULD NOT change from occurrence to occurrence of the problem, except for purposes of localization.
  • detail: a human-readable explanation specific to this occurrence of the problem.
  • source: an object containing references to the source of the error, optionally including any of the following members:
    • pointer: a JSON Pointer [RFC6901] to the associated entity in the request document [e.g. "/data" for a primary data object, or "/data/attributes/title" for a specific attribute].
    • parameter: a string indicating which query parameter caused the error.
  • meta: a meta object containing non-standard meta-information about the error.

6. Authentication

In our business, we’d always want to know exactly who is making API calls and getting our data. Therefore, solid and secure authentication is required to give anyone access to that data. The authorization standard in the market place today is OAuth 2.0. The trick here is making it dead simple for developers to get their access token so they can make API calls as quickly as possible.

I believe those six API design guidelines will help us develop Predictable, Intuitive and Efficient API capabilities for us and our developer community. I told you this was an exciting time at Ticketmaster 🙂

Your Feedback

We want you to get involved to help guide this process. Do you think we’re missing something? What are some of the APIs you love? Why do you love them? What are some of the APIs you’d expect us to deliver?

You can join us on this very exciting journey by subscribing to this blog. You can also follow us on TwitterFacebook and Medium.

Happy Coding!

2015: Year of the Android

On the last night before employees were leaving for Christmas break and the 2015 New Year, desks on the 10th floor of the Ticketmaster Hollywood office were littered with red plastic cups foaming with champagne, and greasy slices of fresh Raffallo’s pizza, a Hollywood staple. The normally tepid office was buzzing with enthusiastic conversation and clapping from our Engineering, QA and Product teams, still high on adrenaline from last-minute bug crushing and testing.

The mobile team was celebrating an audacious feat of full-stack software engineering. In a span of only 2 months, the iOS team, along with their counterparts across many teams and offices at Ticketmaster, had successfully implemented Apple Pay as a new payment method within the iOS Ticketmaster app. This was a daunting and technically ambitious success story for a company with many discrete payment and ticketing systems, not to mention business and legal requirements as Ticketmaster. With the new version of the app being released before Christmas, Apple had selected the Ticketmaster iOS app to be featured in the App Store as one of the first adopters of the new payment system.

Meanwhile the Android team hadn’t released a new version in over 6 months. Buried under a mountain of technical debt, a backlog of style updates and new feature requirements, externally there was probably very much a feeling that the Android app was falling even further behind its iOS brother. Internally, as the Android developers left the offices for their vacation, there was reason for optimism within their own development and QA teams as well. Only one message was scribbled along the corner of the whiteboard in the Android development area:

“2015: Year of the Android”

Build Happiness

I moved to California to take up a Sr Android Developer position at Ticketmaster in September 2014.

I was excited about the great Los Angeles weather but I was also pleased to see the metrics associated with the application team I was joining. The metrics showed great year-over-year growth for both the Android application and mobile in general. More and more people were purchasing tickets using their mobile devices, ensuring that the role would have an immediate impact on the company’s short and long term strategy.

Fig_1-tm-stats

Since 2014 there was a 20% increase in mobile visits and a 35% increase in mobile ticket sales

Unfortunately this wasn’t the full story. For years the Android app had accumulated technical debt and there were other issues that required immediate attention before we could focus our attention on some of the exciting opportunities we had identified.

One of the growing philosophies at Ticketmaster is Lean and to think of products in terms of continuous improvement. The idea can be concisely summarized with the sketch from Henrick Knilberg’s “Succeeding with Lean Software Development.”

Fig_2-mvp

Deliver usable products to allow learning to take place. Illustration by Henrik Kniberg.

To help visualize the Android Application in this manner, I liked to pretend the app was itself a concert venue. Just like its real world counterpart, the Android app was serving a growing line of consumers who were migrating from the desktop web application to the mobile app. The app was already serving thousands of people per day, but it had plenty of room to grow. It was this metaphor that made a lot of the problems jump into focus.

Fig_3-android-concert-venue

A metaphor for the Ticketmaster Android app and a small concert venue

Our user base was growing, but how were we handling this increased load?

Fig_4-low-security

Very little security or organization in the main entrance

The line of users looks a little bit unorganized. What exactly are they doing while they are waiting in line? It doesn’t seem like we are providing very much guidance. Was everyone coming to the mobile app from the same place? Were we funneling everyone to the right place? Also, the security sure seemed a little bit light at the entrance.

Fig_5-outdated-navigation

Users confused by outdated and ambiguous navigation elements

Once people actually got into our application (And our “venue” in this metaphor), they were often confused by ambiguous navigation and UI elements. The app was a lot more complex and confusing in terms of UX experience than most of our users were expecting. A lot of this was due to legacy code and design. A lot of the designs from 2011 were persistent in our 2014 app, despite plenty of design stories in the backlog.

Maybe this wasn’t such a great opportunity after all! How do we address all of these issues at once? It was going to be a long time before the Android team was ready to celebrate its own release.

The first key was to make building the app a source of joy for developers and the QA team. The 10-minute build times using Maven and an increasingly confusing set of dependencies had to go. Luckily, there was a new company whose motto was “build happiness.”

Gradle To the Rescue

gradle

The migration to Gradle provided challenges to both our development team and our QA team. But once the migration was made, everything was easier. Build times went from 10 minutes to under 2. With advice we got at the Gradle Summit (a conference hosted in Santa Clara this year) , we were able to decrease our development build times to under 40 seconds using the Gradle daemon and the incubating features like parallel builds and configure-on-demand.

Fig_6 slack_comment

An Android team member discusses the improvements to our build time over Slack

When I was flying home from the Gradle Summit, I used the offline and no-rebuild flags to allow me to use my most recently downloaded dependencies and allow up to date builds while offline. Gradle plugins, such as the Jenkins plugin allow synchronisation between your local Gradle build arguments and the ones you are using in Continuous Integration.

Gradle has been a star tool for Android development, with new training available at:
http://gradle.org/getting-started-android/ and https://www.udacity.com/course/gradle-for-android-and-java–ud867

Every Android developer needs to learn how to best take advantage of this awesome tool. Gradle is more than a build and dependency management tool, it is truly a swiss army knife, as we quickly learned this past year.

As an example, we had to figure out a way to allow our automated UI tests to bypass some of our security measures without risking leaking these bypass mechanisms into the production app. These complex requirements became easy once we approached them as gradle tasks.

Track and manage Technical Debt

One of the lucky parts of starting at Ticketmaster is that I never felt like I was alone on an island as a developer. The QA team at Ticketmaster is the strongest I have ever worked with. They are engineers more than testers and their knowledge and experience with the Android app allowed me to grow into my role. The QA engineers rapidly built up our automated UI testing suite using Cucumber and Appium and added it into our continuous integration pipeline. Even though the app had a long way to go, it was definitely an aha moment when we got our first 100% passing UI tests report that was completed overnight.

Fig_7_1-automated_testing Fig_7_2_sonar

Automated testing and nightly reports

Our automated tests would immediately detect and problems with our ongoing development. But the UI tests that our QA team was building wasn’t enough. They were great at catching new issues within our User experience. But they didn’t take account of technical debt and insidious bugs within the code base.

Most software developers have heard of the testing pyramid, and with the Android QA team pulling its weight with Automated UI testing, the need for unit tests was even more glaring.

We found the first step to cutting down our technical debt was measuring it. And SonarQube provided a way for us to do this within our build pipeline. Sonar calculates a number “Technical Debt” using a combination of factors: Unit test and Integration test coverage, java logic issues and duplicate code.

Adding unit test coverage was not easy. We had to figure out the JaCoCo gradle plugin so that we could calculate the coverage numbers before these could be added to our Sonar console.

When we first got Sonar integrated into our build pipeline the numbers were grim. Over 700 days of technical debt for the Android app alone. At least we were measuring it now and had a starting point. From now on, conversations between developers about technical debt had a point of reference that we could come back to as we debated approaches toward implementing new features and bug fixes. Often times there were opportunities and “low hanging fruit.”

If a developer completed a new feature, we could immediately examine its impact on technical debt. Were we adding any new major java logic issues? Had we added any technical debt to our already large code base? Were there deprecated classes or features we no longer needed that could be tackled within this story?

In January we were able to reduce our technical debt estimate by 158 days, and increase our unit test coverage by 20%. Part of this was easy; we had some unit and integration tests that hadn’t been measured by sonar. Other gains were made by getting rid of deprecated code that was no longer needed.

Code Quality Summary (Jan 01- Jan 30)

Technical Debt – Reduced by 158 days
Unit Test Coverage – Increased by 19.4%
Major Issues – Reduced by 458

Fig_8-Sonar-stats

SonarQube measures delta improvement to unit test coverage

In February we ripped off another big chunk, 298 additional days of technical debt removed from the app. As we moved closer to releasing our new re-skinned app, we were also cutting down technical debt. This two in one combo of adding features and making the app easier to maintain allowed us to make quicker and quicker improvements for our users. Not to mention, this also improved our build times even further.

Security isn’t an Afterthought

For many months, our focus was on getting the “Android Design Update” out the door, and we were aiming for a February release. Unfortunately, there were security issues that forced us to change our priorities and release schedule.

Going back to our “Event Venue” metaphor, we were hearing more and more from our API team and our data science team that the Android App was getting abused by malicious users.

Fig_9-bots

Bots were grabbing the some of the best seats in the Ticketmaster Android App

This was a major issue that we needed to address. In Android, the basics of security come from three places:

  • Secure coding practices
  • Code obfuscation and minification using Proguard
  • TLS network security

The Android app was using these practices already but it wasn’t doing enough. With a high profile application like Ticketmaster, malicious users are likely to go to extra effort to defeat traditional measures. We needed something extra.

Enter Dexguard. If you haven’t heard of Dexguard, it is an extra layer of security provided by the makers of Proguard but designed specifically for Android applications.

Dexguard allowed us to add not just one, but many additional layers of security.

Fig_10-dexguard

Dexguard added lots of new security features to the Ticketmaster Android App

Some additional security layers provided with Dexguard that were not possible with Proguard:

  • name obfuscation
  • string encryption
  • method reflection
  • tamper detection (at runtime)
  • environment checks (at runtime)
  • class encryption

When using tools to decompile the Android app, Dexguard makes things much more difficult for reverse engineering. Dexguard allowed us to push a lot of the bot traffic out of the Android app immediately.

Continuous Integration

The 6 month delay between releases was too long, If not for the security updates, we would have gone this long without delivering improvements to fans. With Gradle, Sonar, and automated UI testing, we had the tools we needed to deliver constant updates without suffering out code quality. With each commit, we trigger tests, our lint and sonar code quality metrics, and our automated UI tests. All of these build steps are triggered by gradle tasks.

Fig_11_2-CI-pipeline Fig_11_1-CI-pipeline

With a revamped CI pipeline with automated testing, new features can be added more reliably without incurring additional technical debt

After we released our Android Design Update update in late February, we started with bi weekly updates. After going through only 2-3 updates (sometimes only for security reasons) in 2014, we have released more than 10 updates to the Android app since February. This is the reason the Android developers were secretly happy back in December, we knew were close to putting it all together.

Think Big Picture

As an Android developer, it’s very easy to get sucked into the issues that affect you on a daily basis. How do I make this list update faster and make this UI consume less memory? What is the best way to create a clickable span within an expandable list? But it’s equally important to see the big picture. How will this app become what it is capable of

Fixing the process, and tracking was the most important step. Taking some time each week to think about the pipeline and how it fits into the larger picture was key. For mobile architecture, I strongly recommend using C4 diagrams which allow developers to layer additional details for more sophisticated engineers.

When I am introducing a new members to the Android code base, I can start with a Context diagram, and once I feel they understand it, move on to the containers, components, and of course, our existing technical debt and long term goals. Using C4 diagrams it is easy to visually pinpoint which parts of the codebase are holding more technical debt. It’s a springboard to software fluency.

There will always be new unforeseen challenges. A few weeks after we released our Android Design Update, most bots moved to attack the iOS app. Other more advanced hackers found a new way to exploit our existing security, requiring us to display Google’s reCAPTCHA to all users attempting to purchase tickets on the Android app. Our users were not happy.

Working with our API team, we are going to implement new features to overcome this issue and we will always have new challenges. With all the improvement that have happened to the Android app over the past year, it really validates the team’s work when we see feedback like this in our user reviews:

review

The scribble on the whiteboard was just an idea.  A simple idea that our team could make the kind of improvements that had started become a part of our process and that releases could just become routine.

And then it happened.


References:

Gradle: http://gradle.org
Sonar: http://www.sonarqube.org
Dexguard: https://www.guardsquare.com/dexguard
Jenkins: https://jenkins-ci.org
C4/Structurizr: https://www.structurizr.com/
JaCoCo Gradle Plugin: https://docs.gradle.org/current/userguide/jacoco_plugin.html

Jeff Kelsey is a Sr. Android Developer at Ticketmaster

What Ticketmaster is doing about technical debt

This post describes the journey Ticketmaster has been on over the last year to define and measure technical debt in its software platforms. The issue kept surfacing from multiple sources, and yet as an engineering organisation we had no consensus on how to define technical debt, yet alone measure it or manage it. So we embarked on a journey to answer these questions, and gain agreement across the engineering organisations in order to effectively provide a common approach to solving the problem.

We started with research and found that technical debt is all around us:

fowler-on-debt

A chilling example is of Knight Capital who ran updates on their systems that reactivated some dead code, causing the system to spit out incorrect trades – losing $460 million in under an hour (Source).

Ultimately debt management is a business decision – so how do we as IT professionals source and present the right data to influence the decision makers? Part of articulating the size of the problem was to compare the size of the Ticketmaster codebase to other codebases of a similar size:

tm-tech-debt

Over 4.5 million lines of code are spread across 13 different platforms, from legacy to greenfield, across a whole mix of technology stacks including different flavours of .Net and LAMP. We formed a working group with members from different platforms and locations in order to build a model that would work across all these boundaries, and would have the buy in from all areas. Our research can be summarised as follows:

key-findings

We used these 3 different areas as the top level of categorisation for the following reasons:

Application Debt – Debt that resides in the software package; unchecked build up can lead to:

  • Inflexibility and much harder to modify existing features or add new ones
  • Poor user experience
  • Costly maintenance and servicing

Infrastructure Debt – Debt that resides in the operating environments; unchecked build up can lead to:

  • Exposure to security threats and compliance violations (e.g. PCI compliance)
  • Inability to scale and long queuing times for customers
  • Poor response and recovery times to outages
  • Costly maintenance and servicing

Architecture Debt – Debt that resides in the design of the entire system; unchecked build up can lead to:

  • Software platforms that are highly inflexible and costly to maintain and change
  • Flawed design that can’t meet the requirements of the business
  • Single points of failure which cause outages or exacerbate them
  • Unnecessarily complex systems that can’t be adequately managed
  • This gives opportunities for more nimble rivals to gain competitive advantage through technology

Model

Having selected the areas to measure, we needed a model that could be applied across the huge range of technologies used throughout TM’s platforms. Testing both automated and manual processes with various teams and tools helped to refine the model so it could be applied consistently:

tech-debt-model1

We aimed to automate the collection of as many of the metrics as possible via automated tooling to make the process repeatable:

  • Application Debt:
  • Infrastructure Debt:

Manual Mapping

Where automated tooling hasn’t existed to measure the metrics we identified, a manual process has been introduced to help measure the intangible. We used the following guiding principle:

measure-anything

It was clear that a lot of what was known about the limitations of a platform or component wasn’t being reported by tooling, but was readily available in the minds of engineers. We just needed to figure out a consistent, repeatable and transparent process for extracting that data and making it useful. Out of this need was born the technical debt mapping process:

mapping

Reporting: Telling the story

At this point, we’re left with lots of data. One of the core driving philosophies behind the whole process was to present pertinent information to enable executive level decisions to be made as to how to manage debt from a strategic point of view.

need-to-know

For example, if debt remains stubbornly high for a critical application or area and is reducing developer throughput to a crawl, then a strategy of downing tools to address the debt may be the best option – but this is likely only to be able to be made at executive level. Executives only really require the most pertinent information, but if required the data behind the summary would also need to be readily available. The report is divided into three parts:

parta

Part A (above) contains a rolled up score for each of the three main debt categories for each of the systems being reported on.  It also contains a summary of the debt in the platform which have been contributed by the Engineering, Tech-ops and Architecture groups.

partb

Part B (above) contains a more detailed breakdown of each main category’s definitions for each of the systems being reported on.  This gives the reader a better insight into which definitions have higher or lower debt levels and an indication where work needs prioritising.  It also shows the next items to be worked on.

partc

Part C (above) contains the longer term technical debt backlog of for each of the systems being reported on, broken down by category. There is no indication of time for each item but some could span months or even years.  This section is aimed more towards the Engineering and Architecture teams.

interpret-report

What do we do with the output?

Update Technical backlogs

  • Updated with new debt items as they are identified during the mapping processes
  • Technical debt items prioritised according to criticality, not level of debt – some components may contain a lot of debt but are stable, or no longer strategic

Update Product Roadmaps

  • Selected technical debt items prioritised against product roadmap items
  • Product teams need to be bought into the importance of maintenance work
  • Value needs to be clearly defined and communicated, in order to make the right strategic decisions for scheduling the maintenance work and managing the debt.

What next?

Management of technical debt is more than just about identifying and scheduling maintenance work. With the plan to issue the report quarterly, the intention is also that the visibility the report provides, plus the tooling provided to engineering teams will help to stem and reduce the introduction of technical debt. By adhering to industry best practices, and being conscious of the implications of introducing further debt, engineering teams can take steps to build quality into the products and platforms as they go. The debt that is introduced as projects progress can therefore be minimised, and product and engineering teams can make more informed decisions about how and when to pay debt down.

Ultimately the goal of each of us is to delight our fans. Running well maintained systems will benefit them with better stability, better response times and ultimately faster delivery of features as debt is brought under control, managed and interest rates reduced to a sensible level. Debt management will benefit the whole business in a similar way – less firefighting, fewer outages and platforms that are easier to develop. Ultimately, we all stand to gain.

Taking a Stand Against Overdesign

This post was co-authored by Jean-Philippe Grenier and Sylvain Gilbert. 

Have you ever spent countless hours arguing about how things should be done? Have you ever been in an analysis-paralysis scenario where you were going through so many use cases that you couldn’t figure out the architecture? Have you ever seen an architecture being built around a particular case which we may eventually want to support? Continue reading

The Case of the Recurring Network Timeout

This post was co-authored with Roshan Revankar

At Ticketmaster we’re passionate about monitoring our production systems. As a result, we occasionally come across interesting issues affecting our services that would otherwise go unnoticed. Unfortunately, monitoring only indicates the symptoms of what’s wrong and not necessarily the cause. Digging in deeper and getting to the root cause is a whole different ball game. This is one such example. Continue reading

Tools Shaping Culture

Ticketmaster’s Mark Maun recently presented at the Southern California Linux Expo on how great tools can actually be a driving factor for cultural change at scale. Ticketmaster’s DevOps culture has gone through transformative change largely through the use of open source tools. In Mark’s SCALE 13x presentation, Mark walks you through the motivations for change and shares examples of how great tooling has impacted Ticketmaster’s ability to increase product velocity and overall system reliability at scale. Mark’s presentation starts at 3:43:00.

You can see an expanded description of the presentation from the SCALE website here.

Lean Transformation

The Ticketmaster North America Ticketing Product and Technology teams (TM) are emerging from the minimum viable product stage of a transition to a Lean portfolio management and execution approach.  The process is bringing interesting and important changes to how TM determines what to work on and how we work. One thing is certain. This will undoubtedly be a validated learning experience. Continue reading

View from Section – Behind The Scenes

I’m a tennis fan and going to the US Open has become an annual tradition for me. So, I was more than excited when the US Open was among Ticketmaster’s first few events with the View from Section feature. Our aim as always is to provide a great live event experience to fans – allowing fans to check out the view from their seat before they buy tickets is just another part of that mission.

VfS-screenshot

Continue reading

Getting Over the Performance Hump with Apache Camel

This post goes over some of our findings in improving performance of one of our Apache Camel based web services.  The past year, Ticketmaster has seen tremendous growth in our new product, TM+, of which our service is a key component. With the increase in traffic, we have been working hard to ensure acceptable response times for our customer-facing services.  For the most part, our service has performed well, with a majority of requests completing within our desired duration. However, we had occasional requests which took longer to execute, at seemingly random intervals, so we set about to investigate and fix these worst-case performance spikes. Continue reading

Fear and Paranoia in a DevOps World

DevOps as a philosophy has been gaining momentum among startups and enterprises alike for its potential of delivering high quality features to the customer at a fast pace.  It’s no surprise that at Ticketmaster, we are championing the DevOps model so that we can deliver on our promise of enhancing the fan experience. The DevOps model implies rapid iteration on features and specifically, the focus of this post, frequent production deployments. Continue reading

Continuous Integration with TeamCity and Octopus

Why Continuous Integration

A software development practice where members of a team integrate their work frequently, usually each person integrates at least daily – leading to multiple integrations per day. Each integration is verified by an automated build to detect integration errors as quickly as possible.

~Fowler

I have found most developers have a  good intellectual understanding for the process of implementing Continuous Integration (CI) in their development practices.  However, like Agile, unit testing, and unicorns they view CI as a mythical creature that exists only in the rainbow blogs of Silicon Valley startups.

Part of this gap exists because of a fundamental  misunderstanding of the value of CI.   CI is more than simply a time saving device meant to eliminate manual steps.  CI’s value is actually a logical extension of capital ‘A’ Agile.  Good CI allows developers to create software in a very fast build, deploy, test cycle and adjust direction based on observed outcomes.  It is the “Empirical Process Control” my friend Larry Johnson regularly reminds me about.  It is the ability to gather data early in the development process and change course quickly based on what we have learned.

Continuous delivery means minimizing lead time from idea to production and then feeding back to idea again

~Rolf Andrew Russell

Read on for a deep dive on how the Ticketmaster Resale team integrated Continuous Integration principles into their own development best practices… Continue reading

Announcing Metrilyx

A couple of months ago, we announced the availability of Ticketmaster’s visualization and analytics platform ‘Metrilyx’ as an open source offering at the Southern California Linux Expo.

We use Metrilyx internally at Ticketmaster as our engineering and operations dashboard platform with OpenTSDB as the data source. With Metrilyx it’s quick and easy to create dashboards for time series and performance data. Some of our more interesting graphs actually track business metrics like ticket orders per second. Metrilyx is incredibly flexible and supports several million data points per page. Our dashboards refresh multiple times per minute! Continue reading

The Tao of Ticketing

Our ticketing Inventory Service here at Ticketmaster searches for and reserves seats in an event. It exists exclusively for inventory management, unlike the classic Ticketmaster “Host”, which also handles reporting, user management, event programming, and more in addition to seat selection. The Inventory Service is arguably Ticketmaster’s core. It is the base layer underneath the entire Ticketmaster architecture. It is the singular part of our software stack that gives the stack its unique Ticketmaster identity.

It is also a seeming anachronism in an SOA architecture. The Inventory Core (the lowest level component) is written in C++ with assembly bits for hyper-critical sections. It communicates via Google Protocol Buffers; doesn’t scale; and can’t be distributed. Why?

Ticketing is a complex and unique technology and computer science challenge which requires the unique solution we’ve developed. How do we allocate multiple high demand spatially aware contiguous seats in a fair but ultimately performance-oriented manner? Continue reading

Access Control Testing with Calabash

The JAC Scanner team is developing Android-based access control software to scan tickets for entry or exit at venues. The JAC Scanner application is designed to run on any Android cell phone using the camera to scan tickets, and it will eventually support dedicated scanning devices such as those available from Janam and Motorola. The application is being developed in C# using Xamarin. It has a hybrid user interface, with all controls contained in a web view object. We chose Xamarin and the hybrid interface with an eye towards more easily porting the application to iOS in the future. Continue reading

Powering Life’s Experiences

Welcome to the Ticketmaster technology blog. Each month, we will publish an entry or two describing technical challenges at Ticketmaster and how we go about solving them. We are building the ticketing platform for the future that will define the industry and look forward to sharing our journey with you.  Your comments and suggestions for posts are welcomed. Continue reading