By Ian Withrow
In my last post on the business impact of Web Performance Monitoring (WPM) I looked at how WPM helps the bottom-line by letting you know when something breaks. This time I’m going looking for trouble to see how WPM can be used to proactively improve business performance. To do this I’m going to bring in another topic I wrote about earlier: distribution of WPM data and how it’s not in fact normal. Specifically I’m going to look at performance data that in aggregate looks good and see whether there is improvement worth pursuing. Here’s a spoiler, because it’s not normal then using metrics like mean response time leave plenty of opportunity for improvement on the table.
An immediate objection I’m going to get is that you can create a new set of alarms or metrics after the fact for any specific root problem after you’ve encountered. Thus the argument continues all cases eventually convert to the analysis I made in part 1 such that operations can just behave reactively. Operations certainly should create new alarms and monitoring points based on experience. However, there are some limitations to this thinking. First, if you watch and alarm on too much, your overhead and false positive rate can get prohibitive. Second, serious web properties are a complicated and ever changing so the attitude that one could quickly get to a state that they watch everything of consequence strikes me as hubris. Finally, there is a theoretical problem in that non-normal populations defy prediction so creating alarms to cover every case may not be possible. That said I still hope you will discover things to do proactively in this post.
So now let’s return to our anonymous Tier 1 web property from the probability distribution post and look at the USA performance data just for their very important splash page over the past two weeks.
Here are a few important stats, the number of data points is 3276 and the mean is 2.029 seconds. That’s really good performance and I don’t think I’m out on a limb here when I say that number alone probably isn’t throwing up red flags, accordingly let’s assume this is their baseline goal for performance of the splash page in the USA. Note that it’s a pretty heavy page, not Google. Still we can tell from the distribution that some users aren’t enjoying 2 seconds. From my last post, I’ve borrowed my estimate that 2 seconds means a 15% reduction in the value of your website. Then I’ve made some rough estimates of how bad short and longer delays are, to construct the table below.
We can see that 23% of users didn’t get the target performance rate and that this translates into about a 2.7% impact on the business value of the web property for this time period. Depending on how conservative or aggressive you want your estimates to be you should fall between 1% and 5%. If you are a $100MM business, let alone a billion dollar business like this one, you might not be happy with this.
Even if a few percent alone for two weeks doesn’t get your attention there is another matter we should be concerned about: long term impact. At last year’s Velocity Google revealed a small but important tidbit from their study of page delay on user value. Users in their 6 week study who were impaired just 400ms continued to be less valuable after the study concluded.
In other words you can cause lasting damage to your users via poor performance. Unfortunately we can only speculate if just a single incident can do this to your users or if you must hurt them again and again before you earn their lasting mistrust. Likely the truth is that it depends, but for the time being let’s see if we are hurting the same users over and over again. For example, perhaps we are slower for users in particular regions. Keynote has a premade graph for this:
Ruh-roh, Raggy! Indeed it appears that we aren’t giving Chicago and Boston area users the same quality of service. Next we would explore why this is and if it’s a simple fix or else look at the business case if the solution requires some money. Keep in mind I didn’t cherry pick this data population. I picked my guinea pig weeks ago before I had a clear plan for this post. This is a real problem for a serious company going on for the last few weeks at least. What gremlins might be lurking in your data?
The key takeaway here though is not that you should setup alarms for regional performance, although that may be a good idea. What causes some users to be above the mean will vary from case to case, and certainly even here not all of the 23% can be explained by the Chicago/Boston phenomena. Moreover, just looking at the mean for say every major metro area will miss other factors affecting a significant portion of your users. The point is that watching the distribution of your site proactively will indicate problems that alarms based on statistical metrics can’t capture. And just like compound interest in your bank account just a few 3% bumps in revenue, or whatever metric is near and dear to your heart, can very quickly start to add up. Finally, it is importantly to pursue these issues in a timely fashion as bad experiences can have a lasting impact on users.
Photo by Charlie Brewer