Twitter’s Photo Storage (from the outside looking in)

Posted by Mike Brittain on May 24, 2010
Cloud Computing, WWW / 1 Comment

I’ve been working on some photo storage and serving problems at Etsy, which is exciting work given the number of photos we store for the items being sold on the site.  This sort of project makes you wonder how other sites are handling their photo storage and serving architectures.

Today I spent a few minutes looking at avatar photos from Twitter from the outside.  This is all from inspection of URLs and HTTP headers, and completely unofficial and unvalidated assumptions.

Two things I found interesting today were (1) the rate of new avatar photos being added to Twitter, and (2) the architecture for storing and serving images.


Avatar photos at Twitter have URLs that look something like the following:

I’m assuming the numeric ID increments linearly with each photo that is uploaded… two images uploaded a few minutes apart showed a relatively small increase between these IDs.  I compared one of these IDs with the ID of an older avatar, along with the “Last-Modified” header that was included with its HTTP response headers:

Last-Modified Tue, 26 Feb 2008 03:15:46 GMT

Comparing these numbers shows that Twitter is currently ingesting somewhere over two million avatars per day.

Stock, or library, avatars have different URLs, meaning they are not served or stored the same way as custom avatars.  This is good because you get the caching benefits of reusing the same avatar URL for multiple users.

Storage and Hosting

Running a “host” look up on the hostname of an avatar URL shows a CNAME to Akamai’s cache network:

$ host is an alias for is an alias for has address has address

If you’re familiar with Akamai’s network, you can dig into response headers that come from their cache servers.  I did a little of that, but the thing I found most interesting is that Akamai plucks avatar images from Amazon’s CloudFront service.

x-amz-id-2: NVloBPkil5u…
x-amz-request-id: 1EAA3DE5516E…
Server: AmazonS3
X-Amz-Cf-Id: 43e9fa481c3dcd79…

It’s not news that Twitter uses S3 for storing their images, but I hadn’t thought about using CloudFront (which is effectively a CDN) as an origin to another CDN.  The benefit here, aside from not pounding the crap out of S3, is that Akamai’s regional cache servers can pull avatars from CloudFront POPs that are relatively close, as opposed to reaching all the way back to a single S3 origin (such as the “US Standard Region”, which I believe has two locations in the US).  CloudFront doesn’t have nearly as many global POPs as Akamai. But using it does speed up image delivery by ensuring that Akamai’s cache servers in Asia are grabbing files from a CloudFront POP in Hong Kong or Singapore, rather than jumping across the Pacific to North America.

I suspect that Twitter racks up a reasonably large bill with Amazon by storing and serving so many files from S3 and CloudFront.  However, it takes away the burden of owning all of the hardware, bandwidth, and man power required to serve millions upon millions of images… especially when that is not a core feature of their site.

Tags: , , , , , ,

Choosing a CDN

Posted by Mike Brittain on January 15, 2010
WWW / 1 Comment

I thought this was a good article covering 8 Things to Consider When Choosing a CDN.  It’s worth a read.

The only bit I would disagree with is that this reads a bit too video-centric for me.  It felt like the main argument for using a CDN is that you don’t have enough bandwidth at your own data center to handle all of the requests being made to your servers.

I use CDNs for delivering static objects like images, CSS, and JavaScript.  An additional consideration I have is how fast cached objects will reach different locations across the country or around the world.  I’m dependent (as most sites are) on one central data center where my pages are being generated.  Every page view needs to make that trip over the network from the browser to my data center.  However, if the 20-100 successive objects can be requested from a regional CDN node, the performance in the end-user’s browser will be much better than if every request made a full trip across the country.

Tags: , ,

Caching URLs with Query Strings

Posted by Mike Brittain on June 11, 2009
WWW / 1 Comment

A common method for breaking a cached object, or for specifying a version of an object, is to put a version number into a query string or some other part of the file name.  I’ve been wary of using query strings in URLs of static assets that I cache, either in browser cache or on a CDN.  I hadn’t seen much hard evidence, but generally felt query strings provide another area of uncertainty in a network path that you already have little control over.

Today, I just found Google’s recommendation for leveraging proxy caches, which states, “to enable proxy caching for these resources, remove query strings from references to static resources, and instead encode the parameters into the file names themselves.”

What is a good alternative?  One method I like is to use a version number that is part of the URL path:


The “12345” is prepended in your page templates using a variable.  The value can be taken from anywhere you want… it could be part of a timestamp, a revision number in your source control, or maybe even manually changed in a constants file anytime your are updating files in your “css” directory.  If you’re using Apache, this portion of the path can be easily stripped off the request using mod_rewrite.

  RewriteEngine On
  RewriteRule ^/\d{5,}(/.+)$ $1

With this method, you’re not literally using directories with version number in them.  Rather, you just use your regular old /htdocs/css directory. The same can be applied to JavaScript, images, or other directories with static assets.

What you gain is the ability to break your cached versions of files when necessary, but to also set long-term cache times on these assets (while they’re not changing) and expect cache servers between you and your users to properly store these objects.

More to come…

I’ll be speaking at Velocity in two weeks about how to optimize caching with your CDN.  I have a lot of content that I’d like to present, but only 20 minutes to do it.  Over the coming weeks I’ll be writing a number of posts covering topics that don’t make it into the presentation.

Tags: , , ,

Velocity Schedule

Posted by Mike Brittain on April 08, 2009
Misc / 1 Comment

Velocity 2009
I’ll be speaking at Velocity on how to improve service with your CDN.  The talk is part of the web performance track.  A detailed outline of the talk is available on their site.  I’m pretty excited to share some of the tricks I’ve learned over the last 3 years with CDNs.  There’s also a great roster of speakers who I’m looking forward to seeing there.  Early registration is now open.

Tags: , , ,

Cyberduck Support for Cloud Files and Amazon S3

Posted by Mike Brittain on January 21, 2009
Cloud Computing / 1 Comment

Cyberduck is a nice Mac FTP/SFTP GUI client that I’ve used in the past for moving files around between my desktop and some web servers.  Turns out they’ve added support for moving your files directly to Amazon S3 and Mosso (RackSpace) Cloud Files.  This means that you can use the same tool that you may previously have used for publishing content to your own web server to instead publish content directly to a self-service CDN.  Amazon uses it’s Cloud Front service to distribute files, and Mosso is supposed to be integrated with LimeLight networks for distributing content from the Cloud Files system.

Just wish I had these services available to me three years ago.  They would have saved me some serious cash on bandwidth commits for CDNs for those silly little projects I was working on.

Tags: , , , , , , ,

Download Speeds from Amazon S3

Posted by Mike Brittain on September 25, 2008
Cloud Computing / Comments Off on Download Speeds from Amazon S3

I’ve been planning to post some details about download speeds that I’ve seen from S3, and why you shouldn’t necessarily use S3 as a CDN, yet.  Granted, Amazon recently announced that they will be providing a content delivery service in front of S3.  This post has nothing to do with the CDN (CDS).

Scott posted the presentation he gave at the AWS Start-Up Tour.  It’s worth a read, and is a good summary of our business case for building the EC2/S3 hosting platform that I led when I worked at Heavy.

His post includes this graphic of how we measured S3 delivery speeds throughout the day.  Matt Spinks wrote the Munin plugin that generated this graph, and he tells me he’s planning to make that available for others to use.  When it is, I’ll add the link here. It’s now available at Google Code.

As we measured it, S3 is fairly variable in their delivery speeds.  Unfortunately, we didn’t measure latency for initial bits, which would be good to know as well.

My own impression is that it is not a good idea to host video directly from S3 if you run a medium to large web site.  The forthcoming CDN service will probably help with this.  If you’re a small to medium site, you might be happy with hosting video on S3.  Hosting images and other static content (say, CSS and JS files) might also be a good idea if you don’t have a lot of your own server capacity.

For my own use, I’m planning on using S3 to host images and static content for some other sites I run, which use a shared hosting provider for serving PHP.  On a low traffic site, I’d be happy to offload images to S3.  And when the CDN service becomes available, the user experience should be even snappier.

One thing to note if you plan to use S3 for image hosting, look into providing the correct cache-control headers on your objects in S3.  You need to do this when putting content onto S3.  You can’t modify headers on existing content.  More on this in a future post.

Tags: , , , ,

Amazon Is Launching a CDN

Posted by Mike Brittain on September 18, 2008
Cloud Computing / Comments Off on Amazon Is Launching a CDN

Amazon Web Services is about to launch what appears to be a globally distributed delivery network for content stored in S3.  Details have just been released today about the new service, which looks to work in conjunction with existing S3 content… which is niiiice.  New domain names are handed out through an API, which presumably will CNAME to a node that is geographically close to the end users POP.

This should continue to heat up the CDN industry, which has already become quite competitive with a number of small players who have come on the scene over the last 2-3 years to challenge the big fish, Akamai.  I’ve had the pleasure of working with Akamai for deliverying Heavy’s content in the past, and can say that they set themselves apart from the rest of the industry by providing a wider range of products than just content delivery.

I’ve used S3 as an origin server to other CDNs in the past, and look forward to seeing how their own delivery network compares in speed, latency, and pricing.  On the surface, this looks amazing for start-ups and other small online businesses who may not be able to afford a large contingent of vendors.  My suspicion is that the pricing will be incredibly competitive, and look forward to seeing actual numbers.

Tags: , ,