By Ian Withrow
As we’ve discussed in various blog posts, websites are like a sausage. Ok maybe not so directly, but like a sausage they are made of ingredients than come from many sources even though they are presented in one tidy package to the user. Today I’m going to break apart the sausage that is the TechCrunch blog, show how this can be easily done for any website using KITE, and as a special bonus show how a sausage maker can monitor all the pieces of their links using a Keynote technology called Virtual Pages.
TechCrunch Composition
First a fair warning: just like with sausage making, finding out what is inside your favorite website is not always pretty. If you feel you are of a squeamish disposition then you have been warned. Second, note that all details in the post are from the time of writing and the balance of ingredients is likely to change overtime.
TechCrunch is a behemoth of a site, weighing in at just over 4 MB of data, 329 page elements, and a whopping 65 domains. About half of this comes from them directly or really via Wordpress who is evidently the platform they use to power their blog. The rest comes from over 20 3rd parties. You read that correctly, 65 domains in total and half the content originates from someone else. After direct content the next biggest category is from social sites; Facebook, Twitter, and tools related to these properties total about 1 MB. Google is about a 500 KB and the ‘Misc.’ category of various 3rd party tools that TechCrunch uses to improve user experience is about 200 KB. Ads and ad related content are about 180 KB. Below is a snapshot from KITE breaking it out for you. (Note you need to complete the process in the next section to actually get this view)
Now it’s time for a few fun observations. The amount of content from Facebook and Twitter is huge! Each alone is bigger than most websites are in total. Digging into this is unfortunately off topic for this post but it is definitely something on my radar screen for the future. Another interesting area is the level of user tracking that goes on. I could identify at least 6 different 3rd parties that were tracking TechCrunch visitors, not including Google and Facebook. TechCrunch knows what you are, if not who.
Finally, while these stats make it seem like TechCrunch is hardly advertising, understand that TechCrunch is a very, very, very long page (vertically) and all the ad content is at the top where the user is most likely to see it. They aren’t dummies giving away their yummy sausage for free.
Scripting Sites for 3rd Party Monitoring & Analysis
To make sense of this mess I used Keynote’s KITE product. There are a lot of other great, free products out there that one can use to view all the content and domains of a page. However, KITE has the ability to permanently parcel out these domains into what we call Virtual Pages for ongoing monitoring and analysis. Note this section won’t be a detailed how-to; I’m going to focus on highlighting what is possible with the tool. After which you should be prepared to experiment or watch this training video depending on your learning style.
After downloading TechCrunch in KITE I organized the content by domain as shown below.
This lets me easily see the composition and breakout of a page in a manual fashion. If I just want to see the domains I can simply collapse the domain groupings. There are tons of options that I can add to this view like content size and various time breakouts based on my interest. Here is a complete list:
With just this you can see that I can casually learn a lot about the page. However, if I’m serious about how TechCrunch and its 3rd parties perform then I need ongoing data points. If I’m going to gather a lot of data then I don’t want to do this parsing and analysis manually, it just won’t scale. The solution is to organize this content into permanent logical pieces. For example, in a simple scenario I’d carve out a Virtual Page for my advertising so I could monitor and analyze the performance of that content separately from my content with Keynote. As you can imagine the more complicated your site becomes, the more important this exercise is. True you can always pick through a waterfall manually to see who did it in the event of the problem but if you want to have ongoing data about 3rd party performance or be proactive with alerts then you’ll need something like Virtual Pages. The nice thing is once you’ve designated content into a Virtual Page you can monitor and analyze it like a regular page.
Let’s discuss how I broke-up and organized TechCrunch. Please note I’m not holding this up as the standard for the best or only way to use Virtual Pages. One thing we need to keep in mind is cost. Each page (virtual or otherwise) adds to the cost of the measurement and so in the real world we probably can’t go hog wild with these. Given an unlimited budget I’d define a Virtual Page for each 3rd party, possibly even one for each domain if I was especially crazy for detailed data. My guess is you live in the real world and even if your site isn’t as complicated as this one you’ll need to create some buckets. Most likely you’d start with prior experience, defining Virtual Pages where you knew or suspected there was a problem. Here I simply broke the site into the following logical categories:
- TechCrunch direct content plus AOL
- Google (but not Google owned advertising)
- Twitter (and related tools like Postup)
- Wordpress (even though this is the core of the site, I want to evaluate my vendor here)
- Misc. Tools and Widgets for the users
- Analytics and User Tracking
- Ads and Ad related content
Why no CDN category? We certainly encounter CDN’s here but each is tied to a specific 3rd party. Facebook has its own CDN, the ad platforms have CDNs and so forth. So instead I left the CDN’s with their respective masters.
Here is a brief teaser for how this is done in KITE
Step 1) Pick the URL you want to virtualize and run the page once (we did this already)
Step 2) Right click on the page in question and select ‘Insert Virtual Action’
Step 3) We now have a new Virtual Page at the bottom of your script. Right click ‘Match Page Elements’ and select ‘Add URL Match’. Here I’ve used the naming convention “vp:TechCrunch” to distinguish Virtual Pages from real pages. You can name them anything you want in practice though. There are other options that you can use to construct Virtual Pages, such as content type, that have interesting possibilities but to address 3rd party content, URL seems ideal to me. As you see below I’ve created a list of URL matches that should capture all the differently named TechCrunch domains.
Step 4) In the Script Properties Editor you can create the settings for each URL Match. Note in my script I used a variety of regular expressions so that I could get away with far fewer rules than the 65 domains and still cover all of the page content.
Note that to do this I never had to write any code or do any advanced scripting. It was all point, click, and form completion. Hopefully by now you can see how easy it is to create Virtual Pages in KITE and have an idea of its possibilities.
How can I Benefit from Virtual Pages?
The obvious and immediate answer is you can now isolate and monitor the performance of certain 3rd parties or subsections of you website. If Facebook slows down you’ll know immediately and explicitly that this is case regardless of the overall impact on your performance. Moreover, you can easily track and directly report on the performance of these guys overtime without needing to manually crunch the data and objects yourself. Another interesting possibility is you could monitor your own additions to your site to see how they fair. Finally, another angle might be to isolate and monitor all the Javascript that your site utilizes. There are a countless number of ways that Virtual Pages might be used, and my list probably just scratches the surface. Have fun with it!



Comments