Last week at Velocity, we had the pleasure of connecting with lots of old friends, and thousands of new ones. As a long-time sponsor, we value being a part of the Web Performance and Operations community’s most focused gathering. Sure, it gave us an opportunity to sales-pitch you. (Well, at least up until the fire alarm went off.) But more importantly, we got to learn how you in the web ops community continue to innovate and hear how our tools can help keep your websites awesome.
One of the hot topics of conversation around our booth was our new RESTful API. The beauty of any modern API is the ease with which you can take a service and adapt it to solve very unique problems. The Keynote API is currently being used by customers to quickly and easily integrate their Keynote data with dashboard applications and other monitoring tool streams to do some pretty interesting stuff. We recently spoke with Velocity attendee Christian Jorgensen about his use of Keynote API. Christian is responsible for ensuring high availability for a portfolio of very large websites and services. The work he’s doing in monitoring and alerting is cutting-edge.
Question: What are some of the challenges you’re dealing with in managing and monitoring your environment?
Answer: “Our primary KPI is Mean Time to Mitigate, or Mean Time to Resolve as most are familiar with it. We require a large amount of monitoring and the monitoring has to be sensitive enough so that we can either be predictive or quickly responsive to live site impacting outages or customer impacting events. (But) we are an extremely noisy team from an incident perspective. I think last month, we had over 8,000 incident tickets that were logged inside of our team. So, that's a real challenge, right?”
“Every packet, every bit of information, every HTTP status code is logged into a system and is presented in real time as an available signal. There is so much of this volume that we obviously can't use it and it's difficult to aggregate. We've been using new approaches like complex event processing, which allows us to aggregate data and actually look at patterns and themes, but still it's too much volume.”
“One of our largest sources (of incidents) today are those generated by our CDN due to files that are recently uploaded and may be invalidated. A really stubborn one is ads and analytics. So, you'll have a large number of time outs and 404's and whatnot due to third party advertisers. These are scenarios that do not require a human response.”
“What Keynote has allowed us to do with the new API is to take in the external signals, compare them to our internal sources and corroborate the two to identify the impact of an incident and, of course, whether that incident is real.”
Answer: “With the graphing API, for example, we have the capability to go in and not only look at, ‘Okay, we've met a failure condition in our testing,’ but we know exactly from the waterfall what the problem was and we can build automation that can capture the correct teams or products that are responsible for those failures and we can quickly through automation route those tickets to the correct teams so that they can quickly engage. It's basically taking the human component of analyzing a signal that Keynote provides and making a decision off that. Well, that's something that you can train a robot to do.”
Question: You mentioned advertising being a stubborn originator of incidents; could you elaborate on what you do with those issues?
Answer: “Oh, yeah. Absolutely. It's a huge blind spot about the impact to page performance and overall advertiser accountability, especially in third party scenarios, which, is a pretty complex scenario.”
“What we've been able to do is use the data to actually demonstrate very clearly not only what the impact is from a performance perspective, but we're also able to actually specifically call out which ads and which advertisers are generating the most noise. And those are ones that we actually have monthly rhythm of business now where we meet regularly with the advertising leadership. We're basically the data provider to measure improvement in that space as well. That's probably why we're not so popular. We have an automated bug system that will write bugs to them every time we encounter a large number of, we call them advertising defects.”
“Now with the new API we're able to look at the data in real time and assign it not just for attention not just from an improve perspective but as an actual incident response. There's a dedicated advertising team that fields those incidents and corrects campaigns or errors with ads in real time. And we were never able to provide a signal for that and they had never had one (previously).”
These are just some of the ways that Christian has innovated his Web operations through automation with tools like Keynote. You too can also take advantage of Keynote API, even if you’re not a customer. Check out our recent post about accessing the Keynote Business 40 index programmatically. This is a great way to build third-party, objective performance references against your monitoring data and corroborate availability issues (externally) in real time. The data is free, so let us know how you use it!