EC2 Hosting Architecture, Two Years Later

Posted by Mike Brittain on July 12, 2010
Cloud Computing / Comments Off on EC2 Hosting Architecture, Two Years Later

It’s been nearly two years to the day since I wrote my post about the hosting platform I setup on EC2.  The post still gets plenty of traffic and this week I was asked if it was still valid info.  I think there are some better ways of accomplishing what we set out to do back then, and here is a summary.

1. Instead of round-robin DNS for load balancing, you can now use Amazon’s Elastic Load Balancing (ELB) service.  The service allows for HTTP or TCP load balancing options.  I found that the HTTP 1.1 support in ELB is somewhat incomplete and “100 Continue” responses were not handled properly for large image uploads (a specific case I was using).

2. I chose Puppet two years ago for configuration management.  Since that time, OpsCode has released Chef, which is a friendlier way to manage your systems (in my opinion) that we also happen to use at Etsy.

3. Our database layer was built on four instances for MySQL, in a fairly paranoid configuration.  We had strong concerns about instances failing and losing data.  There are a couple of new tools available to help with running MySQL on EC2.  You can use the Elastic Block Store (EBS) for more resilient disk storage, or choose the Relational Database Service (RDS) which is MySQL implemented as a native service in AWS.  Disclaimer: I haven’t deployed production databases using either of these tools.  These are only suggestions for possibilities that look better/easier than the setup we used.

4. When it comes to monitoring tools, Ganglia is terrific.  What I like about Munin is the easy of writing plug-ins and the layout of similar services on a single page for quick comparisons between machines.  Ganglia’s plugins are also dead-simple to write.  In the five months I’ve been at Etsy, I’ve written at least 15.  In the three years I was using Munin, I probably wrote a total of six plug-ins.

Additionally, Ganglia has some sweet aggregated graphs for like machines.  This graph looks like a couple hundred web servers (as stacked lines).

All of the points listed in the “successes” section of that original article should still be considered valid and are worth reading again.  But I’ll highlight the last two, specifically:

Fault tolerance: Over the last two years when I worked at CafeMom we ran a number of full-time services on EC2 (fewer than 10 instances).  While a handful had been running for years at the time I left (had been started before I arrived, actually), other instances failed much sooner.  I can’t stress the importance of automated configurations for EC2 instances, given that these things have a tendency to fail when you’re busy work on another deadline.  I believe that in technical circles they refer to this as Murphy’s Law.

Portable hosting: I’m a big believer in commodity services.  The more generic your vendor services, the easier it is to switch them out when your blowout preventer fails.  I’ve mentioned a few services in this article that are specific to Amazon Web Services (ELB, RDS, and EBS).  If you go the route of Elastic Load Balancer or Relational Database Service, you should strongly consider what services you would use if you had to move to another cloud vendor.

Tags: , , , , , ,

Fragility of the Cloud

Posted by Mike Brittain on June 11, 2009
Cloud Computing / Comments Off on Fragility of the Cloud

A lightning strike causes EC2 outages and Om Malik blames the “fragility of the cloud,” rather than recognizing that all tech suffers failures.  I’ll say it again, this could have happened to my own servers, or my own data center, and I would have been much further up the creek than if Amazon team was taking care of it.  Besides, one of the most important lessons I have learned from working with AWS is that servers/services should fail, and fail gracefully.  It shouldn’t matter whether that service is “in the cloud” or in your data center.

Tags: , ,

Some Thoughts on Cloud Computing

Posted by Mike Brittain on August 17, 2008
Cloud Computing / 1 Comment

This post was republished by SYS-CON’s Cloud Computing Journal on Aug. 19, 2008.

What I’ve put together below are my thoughts following a recent panel on cloud computing in New York City.  Thanks to Murat Aktihanoglu at Unype for putting together the panel.  While the discussion was varied, and maybe disjointed, I came away with some new ideas.

This post hasn’t been well edited.  I apologize in advance.  These are mostly random notes I took away from the panel and my own experience with web hosting on the cloud.

What is “cloud computing”?

There are a variety of notions to how cloud computing is defined.  I tend to think that what this really boils down to is the ability to procure hardware or services that you wouldn’t normally have access to in a physical sense.  Rather than buying 20 new servers, you can spin them up on-demand, and also dump them whenever you want.  It’s the “utility” or “pay-as-you-go” model.

I don’t see any difference between spinning up one server to run some prototypes, or spinning up 100 to crunch through a huge data set.  People seem to be getting caught up in the notion that unless you are doing some sort of parallel processing with lots of nodes, you aren’t doing “cloud computing”.  I disagree.

I also don’t believe that virtualization is necessarily the same as cloud computing.  To me, virtualization means that you’re essentially splitting up fixed resources you already have into smaller chunks for other people to use.  This is your accounting and human resources departments sharing space on the same machine, but keeping them logically partitioned.  Providers are now selling virtualization under the cloud label.  But if I have to buy (or rent) 20 physical machines to virtualize into slices, then I’m still committed to 20 machines.  If I need more or fewer resources, I may need to work through a contract or serve out a lease term.  It’s no longer pay-as-you-go, it’s a major expenditure.

Software as a Service

I love the software as a service model.  I like having someone else running a database or mail service so that I don’t have to hire a team or own the plant to support it.  With the service being off-site, I don’t have to worry about local disasters (though be sure to watch out for providers without their own SLA or disaster recovery plans).  Our clients are again becoming thin.  Laptops will have fewer and fewer local applications installed, and simply access various online applications and databases.

Again, pay as you go.

Additionally, fewer staff to manage services in-house.  This means you won’t/can’t strangle your sysadmin when hosted email goes down for six hours.  That can be a good or a bad thing, depending on how you look at it.

The best part about this model, though, is that you focus your own resources on what you’re best at.  Does an online marketing agency need to know how to administer an Exchange server?  Or should that be outsourced to a company that has the expertise to run mail for over a hundred other companies?

Cloud != Scale

This seems like a typical misconception: If I build my application on a cloud computing platform, then it will automatically scale. Environments like EC2 provide the ability to scale your application horizontally.  Your application, however, still needs to be able to benefit from horizontal scaling.  If you can only handle 5 concurrent users per node, then adding more boxes isn’t going to get you to 10,000 users very quickly.  This seems obvious, but many people are still missing this point.

I don’t think there are many case studies yet of companies with applications “in the cloud” who also have suffered large amounts of traffic.  And when we do see more of these applications, they will tend to have been built by early adopters who are probably experts in their fields. These cloud services are not yet open and approachable enough so that you have your average developer poking around and building applications that have the DNA for failure.  Google has done a good job with promoting AppEngine using videos and hack-a-thons.

Decent architecture is always going to be foundational for scale.  Your application has to benefit from the availability of additional nodes.

Redundancy and Planning for Failure

Amazon gets a lot of heat when S3 goes down, or when Gmail is unavailable.  This is all a lot of finger pointing, especially by people have not started using cloud services — The “I told you so” crowd.  Truth be told, the day after the recent S3 outage, my company had an application that was offline for nearly the same amount of time as S3’s outage.  Are we any better?  No.

It’s incredibly important to have a failover option for your own application.  Before I left Heavy, we designed our storage on S3 so that it could be replicated to physical disks that we have at RackSpace.  When S3 went out, we just flipped over to the physical disks.  Eventually there will be a time when we don’t have enough disk to store what we keep at S3.  That doesn’t mean that it can’t be replicated to another cloud storage service.

Consider having a backup hosting service in place, either physical or using another cloud provider.  Your physical service could be provided by a managed hosting provider, or on some other dedicated hardware outside of your own office.  You don’t need to own your own servers for a backup solution.

If you don’t have much money to spend on physical machines to host your fully operating site or application, think about how you can reduce the site to a version that can be hosted on a minimal number of servers.  Can you maintain a read-only backup?  Can you host a backup of your most popular content (i.e. the top 5%), and temporarily turn off access to the rest of the site?

Abstraction Layers

Something that I have talked a lot about, but haven’t had enough time to spend building, is a good abstraction layer on top of cloud storage. Everyone seems to have slightly different APIs.  On the other hand, about 85% of the features overlap from provider to provider.  Why not write an abstraction layer to handle the 85% and use multiple services?  This could probably work pretty well for flipping back and forth between (or replicating amongst) various cloud storage services like S3, CloudFS, Nirvanix, and also physical disks.

I don’t know many details about SimpleDB and AppEngine’s datastore, but it seems to me that you may be able to apply this 85% rule to those as well.  You could probably even treat MySQL and PostgreSQL the same way.  You couldn’t use all of the joins and transactions you normally would want to use, but then again, writing an application specifically for cloud computing platforms seems to be a different sort of animal.  We’ve basically been doing the same thing for years with the so-called database abstraction layers.  You can say that you’ve got a layer that allows you to flip from one database engine to another, but chances are, you have some engine-specific code that you’ve been using that doesn’t translate well.

Porting an Application to EC2

I ported an application at Heavy that ran on physical machines we had available at RackSpace onto EC2.  How much effort did it take for the application developers?  Almost none.  We didn’t buy into using SimpleDB — we just ran MySQL on EC2 instances.  We split our team so that we had a couple of us building a few tools for managing our EC2 instances, and the other developers went about their business building a web application that could run on a standard LAMP stack.  Additionally, if EC2 ever goes out of commission, we have the code and databases backed.  They can easily be deployed to physical machines.

It’s worth saying this again… I ported an application from physical machines to the cloud.  This application was not written for a specific cloud service.  We were very concerned about lock-in from the beginning.


What did we gain by hosting our application on EC2?  Initially nothing.  We had the physical machines to run the application.  But as our traffic increases, we can fire up new instances on demand.  If traffic drops off, so does out monthly bill.  It’s variable cost web hosting.

Does hosting your application on EC2 solve scaling problems?  No.  If you can’t improve performance of your application by adding additional servers, then there are bottlenecks to solve.  Running your service on the cloud doesn’t mean it scales.

Furthermore, the cloud is not self-healing.  In other words, it doesn’t automatically monitor your application and grow your infrastructure.  That doesn’t mean, however, that you can’t build your application to do this.  Read Don MacAskill’s SkyNet posting to get some idea of how that can work.

I look forward to reading your comments.

Tags: , , , , , , , ,

How We Built a Web Hosting Infrastructure on EC2

Posted by Mike Brittain on July 19, 2008
Cloud Computing / 23 Comments

In the months prior to leaving Heavy, I led an exciting project to build a hosting platform for our online products on top of Amazon’s Elastic Compute Cloud (EC2).  We eventually launched our newest product at Heavy using EC2 as the primary hosting platform.

I’ve been following a lot of what other people have been doing with EC2 for data processing and handling big encoding or rendering jobs.  This is not one of those projects.

We set out to build a fairly standard LAMP hosting infrastructure where we could easily and quickly add additional capacity.  In fact, we can add new servers to our production pool in under 20 minutes, from the time we call the “run instance” API at EC2, to the time when public traffic begins hitting the new server.  This includes machine startup time, adding custom server config files and cron jobs, rolling out application code, running smoke tests, and adding the machine to public DNS.

What follows is a general outline of how we do this.

Wait!  But first, you should know that this article is pretty old and a lot has changed.  After you’ve read this article, please take a look at my follow up post: EC2 Hosting Architecture, Two Years Later.  — Mike

Architecture Summary

Heavy makes use of a pretty standard LAMP stack.  Administration scripts are written in PHP, Perl, or Bash.  There is a lot of caching in memory (memcached), file caching, and HTTP caching (Akamai).  The new site requires a layer of front-end web servers that double as application servers and a database layer (with replication).  The site is built entirely in PHP, making use of Zend Framework.  The database is MySQL.  We are not using Amazon’s SimpleDB service.

EC2 Hosting Architecture. Click for full-size image.

EC2 Hosting Architecture. Click for full-size image.

EC2 Images

I chose CentOS for the operating system on our machine images.  All of the machine images we built are designed specifically for their purpose, and there are a handful of them.  For example, web servers run Apache, PHP, and some Perl libraries.  Databases are installed with MySQL and PHP (for administration scripts).  Memcached nodes are built with memcached and barely anything else.

Thorsten von Eicken at RightScale has written  a lot of great material about their use of EC2.  I took a lot of ideas from their blog, including the use of RightScripts.  After banging around with some publicly available images, I started building my own by modifying their scripts for 32-bit and 64-bit Cent OS images.

It took a little getting used to the manner of building images with these scripts, and getting the right software packages installed.  Eventually it clicked and what I ended up with was a really simple script for building each type of machine that we would need.  Even better, it was simple to go back and re-configure any of the images and roll them into new machines.

Many thanks to Thorsten for providing these scripts.

Running Instances

Amazon provides some fantastic command line tools for managing EC2 instances and getting status on the service.  Unfortunately, these tools don’t really help much in terms of documenting what each server is doing. To keep track of this, I went to work building a control panel for our EC2 account that documents the roles for each machine and what products are running on it.  Our plan was not to run a single web site, but multiple sites/products each with their own database and web server clusters.

The control panel, by the way, lived on a physical machine of ours (at RackSpace), and not on EC2.

We realized early that all of these machines would need to know how to find each other.  Our control panel manages a global configuration file that lives in our S3 account that documents all of our servers’ roles. Every server is setup to to inspect this file and adjust its own application environment.  For example, when a new web server comes online it  grabs the configuration file and figures out which databases belong to the web site running on the instance.  If a database fails and a slave takes over as the new master, web servers can figure that out on their own without anyone manually logging in to change configs or host files.

Load Balancing

Although we had tested the EC2 servers to handle very high loads of traffic, we have the luxury of using Akamai’s Site Accelerator product in front of our web servers.  This allows for easy page caching in front of our web servers, and actually handles about 90% of the hits to the site.  Our web servers serve as the origin for Akamai’s proxy.  Rather than fooling around with additional servers to handle load balancing and configuring proper failover between them, we simple use round-robin DNS.  As it turns out, our load is very evenly distributed amongst the web servers.

Database Replication

Most people I’ve talked to about this setup want to know how we felt about hosting our database on EC2.  The best answer I can give is, “nervous”. Since EC2 doesn’t have persistent storage on machine instances, yet, we were liberal with setting up replication and backup servers.  A single master database is replicated to a slave (master candidate), and that slave replicates to a second slave (slave candidate).  Scripts were written to handle automated failover if the master becomes inaccessible; the master candidate is automatically be promoted to master, and our global configuration file is updated so that all of the web servers are aware of the change.

Furthermore, we run a second slave from the master database. This slave has a single role: dump snapshots of the database every 15 minutes and store them on S3.  If all of our EC2 instances should ever disappear from EC2, we have recent copies of the database on S3.

In all, 4 databases instances.  We’re being careful.

Server Configuration

Our images are designed to support a specific “role” for a machine, such as a database, web server, etc.  Once started, we identify “products” that will run on each machine.  These might be things like “”, “”, “Video Encoding”, etc.  Obviously, each of our products requires its own set of service configurations (Apache, MySQL), users and groups, and cron jobs.

We chose Puppet to roll out these configurations to new machines (hat tip to Justin Shepherd at RackSpace for this suggestion).  If you’re not familiar with Puppet, you can create classes, or roles, for each of your servers.  In the classes, you define configurations files, cron jobs, packages to install, etc.  Finally, you identify the hosts that belong to each class.  When a new machine is started up (plain old vanilla), it checks into the Puppet “master” server, and the master sends over the proper configs.

When Puppet works, it totally rocks.  It has its drawbacks, however.  There is (what I consider) a steep learning curve for its configuration language.  It’s also still very much in development.  When we upgraded to a new software version, the master server didn’t seem to play well with clients that were still on the old version.  We jumped through some hoops to get all of our clients talking to the master server again.

On the upside, however, we must have gone through setting up over 100 machine instances using Puppet.  And that would have taken hundreds of hours in server administration to get each machine configured.  Additionally, Puppet can do package management, tying into whatever package manager you use in your Linux distribution.  If we had started using Puppet earlier, we might have stuck with two baseline machine images, one for 32-bit, and one for 64-bit architectures.  Then we could allow Puppet to handle all of the software installations.

Many thanks to the guys at Reductive Labs.  Puppet is a very cool piece of software!


We use two pieces of software for monitoring: Munin and Nagios.  The Munin server we use is the same that we use for our physical machines.  The simple configuration needed for Munin nodes is built into our machine images, along with the properly installed plugins.  The control panel we built for EC2 also updates our local Munin server configuration to listen for the new machines.  As soon as we start a new EC2 instance, it begins to show up in our reports.

Our Nagios configuration is a work in progress.  There are two installations, one that we use on our physical machines, and one the lives within EC2.  The EC2-based installation is monitored by the one installed on our physical machines.  It is not tied as tightly to our control panel, yet, but it seems likely for that to happen soon.

Availability Zones

Not to be overlooked, availability zones allow you to distribute your EC2 instances across separate fault-tolerant groups.  If one availability zone goes down, machines in other zones should theoretically be insulated from the same issue, i.e. separate power, separate network connectivity, etc.

We built a color-coded indicator in our control panel of the availability zone where each machine is running.  This makes it easy for us to make sure that we balance our servers equally throughout all of the zones.


It’s always handy to have a physical backup, especially since EC2 is currently still in “beta”.  Since our installation on EC2 uses essentially the same architecture as we use on our physical machines at RackSpace, it would be simple for us to move the entire site back to those servers.  In fact, most of the configurations are already in place.  We also use Neustar as our DNS provider, so we can keep very low TTLs on our hostnames.  When we need to change the location of our origin servers, it’d done in a matter of seconds.


Here are some successes we took away from this project:

  1. Twenty-minute start up time. Hands down, this is the most impressive for me.  We can spin up new machines and put them into production in under 20 minutes.  This isn’t (SkyNet), but it’s pretty darn cool.
  2. Loads of scripts and automation. We moved from mostly manual server administration, which we got used to by running only a few physical machines, to a much more automated process.  This improves our general workflow for server administration, whether the servers are virtual machines or physical machines.
  3. Documentation of images. Using image builders based on RightScripts, we have a catalog of what software goes into each new server, cleanly spelled out in Bash. :)
  4. Fault tolerance. We don’t know what is going to happen with our virtual machines.  We’ve seen some unexpected behavior from EC2, and have designed with that in consideration.
  5. Portable hosting. I didn’t want to build a hosting architecture just for Amazon Web Services.  I wanted to build a fairly standard LAMP stack, but one that is redundant.  We can take all of these learnings and re-apply them to the physical servers we still run at RackSpace.


While I researched and developed much of this project, I couldn’t have finished it off without the help of a couple of other guys at Heavy, Matt Spinks and Henry Cavillones.  Matt led the database effort and all of the scripting involved for our automated failover.  Henry took care of our monitoring needs, image maintenance, and helping me iron out some of the issues we were originally seeing with our Cent OS configuration for EC2.  Thanks, guys!

I also want to mention Scott Penberthy, our CTO, who kept us on track and was an excellent sounding board.  Without Scott at the top of this project, it wouldn’t have come together.  Thanks!

Finally, the clever work put together and discussed by the guys at RightScale and SmugMug, and countless other blog and forum postings I read during this project to keep me in the right directions.

Tags: , , , , , ,

Diversity Factor at Amazon Web Services

Posted by Mike Brittain on April 25, 2008
Cloud Computing / Comments Off on Diversity Factor at Amazon Web Services

Nicholas Carr makes a nice point about “diversity factor” within Amazon’s AWS clients.

We have been looking at a number of services in our own hosting environment that are “spikey”, things like image manipulation, video encoding, marketing email blasts, etc.  By offloading the spikes from our origin servers, we get better efficiency out of those machines — I.e. they are handling consistent loads throughout the year.  We push the spikes onto S3 and EC2.  Amazon’s clients have a wide variety of needs, which help to even out their loads.  I might be driving high load today at EC2, but tomorrow my traffic might be sleepy during someone else’s peak.

Tags: , , ,