Ticketmaster’s Interactive Seat Map Technology from Flash to the Future

If you have a mobile device or read the news lately, you may have noticed that there are issues with browser plug-ins such as the Flash Player. Visiting a website developed with Flash can cause security issues if your plug-in is not up-to-date or, if it is disabled or not available, you can experience reduced functionality.

Currently, the Interactive Seat Map (ISM) feature on our website, ticketmaster.com, is powered at its core by a Flash component. We are doing our best to continue to ensure a smooth and safe ticket buying experience in this rapidly changing environment, such as making sure click-to-play works against our current Flash ISM. At the same time, we are researching and developing multiple new rendering technologies – from building a JavaScript SVG and HTML5-compatible ISM, to an OpenGL ISM for use in native mobile applications, to server-side rendering technology. Further, these tools will give our clients access to customized seat maps for reports and to power our fan views when there is not a need for interaction.

Fans should begin seeing these improvements today on trial events through our mobile website and in the coming months in our new responsive website that we are sending traffic to for some events. Until then, I hope everyone who wants to pick their seat and get a ticket enjoys our distinctive feature of being able to see and select not only verified but exact seats using the ISM.

I know I do, as I used it to purchase three Chicago White Sox tickets two days before a game in July for a family outing. We were able to pick the seats we wanted in row two of the Chris Sale K-Zone section, and see Chris Sale beat Mark Buehrle in a two-hour game in perfect seats. What a great time!

ism

For help with the interactive seat map, please see our FAQ.


Brad Bensen is a Software Architect for the Inventory domain.

Symptom-Based Monitoring at Ticketmaster

monitoring_dash
When Rob Ewaschuk – a former SRE at Google – jotted down his philosophy on alerting, it resonated with us almost immediately. We had been trying to figure out our alerting strategy around our then relatively new Service-Oriented Architecture – the term microservices hadn’t quite entered the zeitgeist at the time.

It’s not that we didn’t we didn’t have any alerting. In fact, we had too many – running the gamut from system alerts like high cpu, low memory to health check alerts. However, these weren’t doing the job for us. In a system that is properly load balanced, a single node having high cpu does not necessarily mean the customer is impacted. More so, in an SOA architecture, a single bad node in one service is extremely unlikely to result in a customer-impacting issue. It’s no surprise then that with all the alerting we had, we still ended up having multiple customer-impacting issues that were either detected too late or – even worse – by customer support calls.

Rob’s post hit the nail on the head with his differentiation of “symptom-based monitoring” vs “cause-based monitoring”:

I call this “symptom-based monitoring,” in contrast to “cause-based monitoring”. Do your users care if your MySQL servers are down? No, they care if their queries are failing. (Perhaps you’re cringing already, in love with your Nagios rules for MySQL servers? Your users don’t even know your MySQL servers exist!) Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop? No, they care if their features are failing. Do they care if your data push is failing? No, they care about whether their results are fresh.

It was obvious to us that we had to change course and focus on the symptoms rather than the causes. We started by looking at what tools we had at our disposal to get symptom-based monitoring up and running as soon as possible. At the time, we were using Nimbus for alerting, Open TSD for time series data and then we had Splunk. Splunk is an industry leader for aggregating machine data – typically log files – and deriving business and operational intelligence from that data. We had always used Splunk for business analytics and for searching within logs while investigating production issues but we had never effectively used Splunk for alerting us to those issues in the first place. For a symptom-based monitoring tool, Splunk now stood out as an obvious candidate for the following reasons:

  • Since Splunk aggregates logs from multiple nodes, it is possible to get a sense of the scale and scope of the issue.
  • It also allowed us to set up alerting based on our existing logs without requiring code changes. Though, over time, based on what we learnt, we did enhance our logging to enable additional alerts.

Since the objective was to alert on issues that impact the user, we started by identifying user flows that were of most importance to us, e.g., add to cart, place order, and add a payment method. For each flow, we then identified possible pain points like errors, latency and timeouts, and defined appropriate thresholds. Rob talks about alerting from the spout, indicating that the best place to set up alerts is from the client’s perspective in a client server architecture. For us, that was the front end web service and the API layer that our mobile apps talk to. We set up most of our symptom-based alerts in those layers.

When our symptom-based alerts first went live, we used a brand-spanking new technology called email – we simply sent these alerts out to a wide distribution of engineering teams. Noisy alerts had to be quickly fine-tuned and fixed since there is nothing worse than your alerts being considered as spam. Email worked surprisingly well for us as a first step. Engineers would respond to alerts and either investigate it themselves or escalate to other teams for resolution. It also had an unintentional benefit because there was greater visibility among different teams about the problems in the system. But alerts by email only goes so far – they don’t do well when issues occur outside of business hours they are easy to miss amidst the deluge that can hit an inbox, and there is no reliable tracking.

We decided to use PagerDuty as our incident management platform. Setting up on-call schedules and escalation policies in PagerDuty was a breeze and our engineers took to it right away – rather unexpected for something meant to wake you up in the middle of the night. Going to email allowed us to punt on a pesky conundrum – in a service oriented architecture, who do you page? But we now need to solve that problem. For some issues, we can use the error code in the alert to determine which service team has to be paged. But other symptom based alerts – for example latency in add to cart – could be caused by any one of the services participating in that flow. We ended up with somewhat of a compromise: For each user flow, we identified a primary team and a secondary team based on which of the services had the most work in that flow. For example, for the add to cart flow, the Cart Service could have been primary and the Inventory Service might be secondary. In PagerDuty, we then set up escalation policies that looked like this:

PagerDuty Escalation

Another key guideline – nay, rule that Rob calls out – is that pages must be actionable. An issue we’ve occasionally had is that we get a small spike of errors that is enough to trigger an alert but doesn’t continue to occur. These issues need to be tracked and looked into, but they don’t need the urgency of a page. This is another instance where we haven’t really found the best solution, but we found something that works for us. In Splunk, we set the trigger condition based on the rate of errors:

splunk-alert

The custom condition in the alert is set to:

stats count by date_minute|stats count|search count>=5

The “stats count by date_minute” tabulates the count of errors for each minute. The next “stats count” counts the number of rows in the previous table. And finally, since we’re looking at a 5 minute span, we trigger the alert when the number of rows is 5 implying that there was at least one error in each minute. This obviously does not work well for all use cases. If you know of other ways to determine if an error is continuing, do let us know in the comments.

This is just the beginning and we’re continuing to evolve our strategies based on what we learn in production. We still have work to do around improving tracking and accountability of our alerts. Being able to quickly detect the root cause once an alert fires is also something we need to get better at. Overall, our shift in focus to symptom-based alerting has paid dividends and has allowed us to detect issues and react faster, making the site more stable and providing a better experience for our fans. Doing this while ensuring that our developers don’t get woken up by noisy alerts also makes for happier developers.