This week I interviewed Herman Ng, a senior engineer on the Keynote Mobile technical team. I wanted to get his thoughts on some of the most common types of problems we see in our customer’s SMS short code measurements. Here was the exchange:
Tony: Herman, you have access to all of the SMS short code measurements we are collecting for all of Keynote's customers. What are a few of the biggest common problems you see in those SMS measurements?
Herman: When we analyze the performance and availability of our customer's short codes, there seems to be two recurring problems that appear in many of the measurements. One, many measurements experience inconsistent performance, and two, a significant percentage of the measurements fail to be delivered at all, or at least within a “reasonable” time period.
Tony: Is that a problem with the carrier, aggregator, or back-end application?
Herman: Depending on each customer, it can be a combination of all three. Often times we notice that a certain short code will start to perform badly when measured simultaneously across multiple operator networks and locations. Because the problem is happening across different networks, there is a strong indication that the root cause may be with the aggregator or application.
Tony: How would you distinguish between a problem with the aggregator and the customer's application?
Herman: Often times we will just submit the "Help" keyword to the short code and measure the average response time across many samples. Since "Help" usually does not trigger complex application logic, performance problems in this scenario tend to point to glitches in the processing by the aggregator. We also examine multiple short codes by the same aggregator. If performance problems show up across multiple short codes by the same aggregator, then the problem is probably not related to the application from one single customer. If the problem happens across multiple operators, then all that is left is the aggregator.
Tony: Earlier you said that sometimes there is a pattern where the short code does not respond in a "reasonable" time period. What do you mean by that?
Herman: If you have a short code meant to return the balance of your credit card, you are probably "mobile" meaning you are not at home, you are standing in a store, so having a quick response is essential for the service to be of value. If the balance response comes back 30 minutes later, it has no value in this “real-time” situation. For each SMS service that we monitor we work with our customer to establish a reasonable SLA on how long the response should take for their particular service. The data we provide then shows them how often their service meets that target SLA.
Tony: Doesn’t the owner of a short code get performance and availability reports from their aggregator?
Herman: Yes, but those reports are from the aggregator’s point of view. If a series of messages are delivered after 30 minutes or more, that might be fine from the aggregator’s standpoint and everything is working 100% from that perspective. But from the owner of an application trying to deliver an account balance with in a few minutes, a 30 minute delivery diminishes the value of the information down to almost zero. Keynote collects information from the “end-user-perspective” which is a different view of the data not provided by the aggregator.
Tony: When a short code seems to be "slow" how does a customer determine how well their service is working versus other similar SMS services?
Herman: There are several ways that a customer can "benchmark" their SMS service. One easy way is for the customer to provide Keynote with an alternate short code that we should measure at the same time we measure their primary code. Our reports can show the side by side performance of a customer's code along with their competitor. We also occasionally publish statistics from the internal analysis of all of the measurements we collect. We don't reveal the identity or source of each measurement, but we aggregate the raw values to produce an average ranking. A few measurements are from our customer's real data, others are from measurements that we internally sponsor. All of our customers can then compare their measurement against the average from the sample group we are collecting. I have provided a recent example in the attached graph.
Tony: It looks like the chart shows the average performance and availability for all the short codes in the Keynote sample group, is that right?
Herman: Yes, a customer can take their measurements and compare it to the performance of the sample group. If they are way off of the average, that might be one motivating factor that they use to understand how things can be improved.
Tony: Thanks, Herman. Maybe in the future we can do another session like this but focus on mobile internet and WAP based services?
Herman: That is a great idea, there are many common problems that we also see in our customer's browser based mobile measurements.


