How We Built a Web Hosting Infrastructure on EC2

Posted by Mike Brittain on July 19, 2008
Cloud Computing

In the months prior to leaving Heavy, I led an exciting project to build a hosting platform for our online products on top of Amazon’s Elastic Compute Cloud (EC2).  We eventually launched our newest product at Heavy using EC2 as the primary hosting platform.

I’ve been following a lot of what other people have been doing with EC2 for data processing and handling big encoding or rendering jobs.  This is not one of those projects.

We set out to build a fairly standard LAMP hosting infrastructure where we could easily and quickly add additional capacity.  In fact, we can add new servers to our production pool in under 20 minutes, from the time we call the “run instance” API at EC2, to the time when public traffic begins hitting the new server.  This includes machine startup time, adding custom server config files and cron jobs, rolling out application code, running smoke tests, and adding the machine to public DNS.

What follows is a general outline of how we do this.

Wait!  But first, you should know that this article is pretty old and a lot has changed.  After you’ve read this article, please take a look at my follow up post: EC2 Hosting Architecture, Two Years Later.  — Mike

Architecture Summary

Heavy makes use of a pretty standard LAMP stack.  Administration scripts are written in PHP, Perl, or Bash.  There is a lot of caching in memory (memcached), file caching, and HTTP caching (Akamai).  The new site requires a layer of front-end web servers that double as application servers and a database layer (with replication).  The site is built entirely in PHP, making use of Zend Framework.  The database is MySQL.  We are not using Amazon’s SimpleDB service.

EC2 Hosting Architecture. Click for full-size image.

EC2 Hosting Architecture. Click for full-size image.

EC2 Images

I chose CentOS for the operating system on our machine images.  All of the machine images we built are designed specifically for their purpose, and there are a handful of them.  For example, web servers run Apache, PHP, and some Perl libraries.  Databases are installed with MySQL and PHP (for administration scripts).  Memcached nodes are built with memcached and barely anything else.

Thorsten von Eicken at RightScale has written  a lot of great material about their use of EC2.  I took a lot of ideas from their blog, including the use of RightScripts.  After banging around with some publicly available images, I started building my own by modifying their scripts for 32-bit and 64-bit Cent OS images.

It took a little getting used to the manner of building images with these scripts, and getting the right software packages installed.  Eventually it clicked and what I ended up with was a really simple script for building each type of machine that we would need.  Even better, it was simple to go back and re-configure any of the images and roll them into new machines.

Many thanks to Thorsten for providing these scripts.

Running Instances

Amazon provides some fantastic command line tools for managing EC2 instances and getting status on the service.  Unfortunately, these tools don’t really help much in terms of documenting what each server is doing. To keep track of this, I went to work building a control panel for our EC2 account that documents the roles for each machine and what products are running on it.  Our plan was not to run a single web site, but multiple sites/products each with their own database and web server clusters.

The control panel, by the way, lived on a physical machine of ours (at RackSpace), and not on EC2.

We realized early that all of these machines would need to know how to find each other.  Our control panel manages a global configuration file that lives in our S3 account that documents all of our servers’ roles. Every server is setup to to inspect this file and adjust its own application environment.  For example, when a new web server comes online it  grabs the configuration file and figures out which databases belong to the web site running on the instance.  If a database fails and a slave takes over as the new master, web servers can figure that out on their own without anyone manually logging in to change configs or host files.

Load Balancing

Although we had tested the EC2 servers to handle very high loads of traffic, we have the luxury of using Akamai’s Site Accelerator product in front of our web servers.  This allows for easy page caching in front of our web servers, and actually handles about 90% of the hits to the site.  Our web servers serve as the origin for Akamai’s proxy.  Rather than fooling around with additional servers to handle load balancing and configuring proper failover between them, we simple use round-robin DNS.  As it turns out, our load is very evenly distributed amongst the web servers.

Database Replication

Most people I’ve talked to about this setup want to know how we felt about hosting our database on EC2.  The best answer I can give is, “nervous”. Since EC2 doesn’t have persistent storage on machine instances, yet, we were liberal with setting up replication and backup servers.  A single master database is replicated to a slave (master candidate), and that slave replicates to a second slave (slave candidate).  Scripts were written to handle automated failover if the master becomes inaccessible; the master candidate is automatically be promoted to master, and our global configuration file is updated so that all of the web servers are aware of the change.

Furthermore, we run a second slave from the master database. This slave has a single role: dump snapshots of the database every 15 minutes and store them on S3.  If all of our EC2 instances should ever disappear from EC2, we have recent copies of the database on S3.

In all, 4 databases instances.  We’re being careful.

Server Configuration

Our images are designed to support a specific “role” for a machine, such as a database, web server, etc.  Once started, we identify “products” that will run on each machine.  These might be things like “HuskyMedia.com”, “Heavy.com”, “Video Encoding”, etc.  Obviously, each of our products requires its own set of service configurations (Apache, MySQL), users and groups, and cron jobs.

We chose Puppet to roll out these configurations to new machines (hat tip to Justin Shepherd at RackSpace for this suggestion).  If you’re not familiar with Puppet, you can create classes, or roles, for each of your servers.  In the classes, you define configurations files, cron jobs, packages to install, etc.  Finally, you identify the hosts that belong to each class.  When a new machine is started up (plain old vanilla), it checks into the Puppet “master” server, and the master sends over the proper configs.

When Puppet works, it totally rocks.  It has its drawbacks, however.  There is (what I consider) a steep learning curve for its configuration language.  It’s also still very much in development.  When we upgraded to a new software version, the master server didn’t seem to play well with clients that were still on the old version.  We jumped through some hoops to get all of our clients talking to the master server again.

On the upside, however, we must have gone through setting up over 100 machine instances using Puppet.  And that would have taken hundreds of hours in server administration to get each machine configured.  Additionally, Puppet can do package management, tying into whatever package manager you use in your Linux distribution.  If we had started using Puppet earlier, we might have stuck with two baseline machine images, one for 32-bit, and one for 64-bit architectures.  Then we could allow Puppet to handle all of the software installations.

Many thanks to the guys at Reductive Labs.  Puppet is a very cool piece of software!

Monitoring

We use two pieces of software for monitoring: Munin and Nagios.  The Munin server we use is the same that we use for our physical machines.  The simple configuration needed for Munin nodes is built into our machine images, along with the properly installed plugins.  The control panel we built for EC2 also updates our local Munin server configuration to listen for the new machines.  As soon as we start a new EC2 instance, it begins to show up in our reports.

Our Nagios configuration is a work in progress.  There are two installations, one that we use on our physical machines, and one the lives within EC2.  The EC2-based installation is monitored by the one installed on our physical machines.  It is not tied as tightly to our control panel, yet, but it seems likely for that to happen soon.

Availability Zones

Not to be overlooked, availability zones allow you to distribute your EC2 instances across separate fault-tolerant groups.  If one availability zone goes down, machines in other zones should theoretically be insulated from the same issue, i.e. separate power, separate network connectivity, etc.

We built a color-coded indicator in our control panel of the availability zone where each machine is running.  This makes it easy for us to make sure that we balance our servers equally throughout all of the zones.

Failover

It’s always handy to have a physical backup, especially since EC2 is currently still in “beta”.  Since our installation on EC2 uses essentially the same architecture as we use on our physical machines at RackSpace, it would be simple for us to move the entire site back to those servers.  In fact, most of the configurations are already in place.  We also use Neustar as our DNS provider, so we can keep very low TTLs on our hostnames.  When we need to change the location of our origin servers, it’d done in a matter of seconds.

Successes

Here are some successes we took away from this project:

  1. Twenty-minute start up time. Hands down, this is the most impressive for me.  We can spin up new machines and put them into production in under 20 minutes.  This isn’t (SkyNet), but it’s pretty darn cool.
  2. Loads of scripts and automation. We moved from mostly manual server administration, which we got used to by running only a few physical machines, to a much more automated process.  This improves our general workflow for server administration, whether the servers are virtual machines or physical machines.
  3. Documentation of images. Using image builders based on RightScripts, we have a catalog of what software goes into each new server, cleanly spelled out in Bash. :)
  4. Fault tolerance. We don’t know what is going to happen with our virtual machines.  We’ve seen some unexpected behavior from EC2, and have designed with that in consideration.
  5. Portable hosting. I didn’t want to build a hosting architecture just for Amazon Web Services.  I wanted to build a fairly standard LAMP stack, but one that is redundant.  We can take all of these learnings and re-apply them to the physical servers we still run at RackSpace.

Acknowledgments

While I researched and developed much of this project, I couldn’t have finished it off without the help of a couple of other guys at Heavy, Matt Spinks and Henry Cavillones.  Matt led the database effort and all of the scripting involved for our automated failover.  Henry took care of our monitoring needs, image maintenance, and helping me iron out some of the issues we were originally seeing with our Cent OS configuration for EC2.  Thanks, guys!

I also want to mention Scott Penberthy, our CTO, who kept us on track and was an excellent sounding board.  Without Scott at the top of this project, it wouldn’t have come together.  Thanks!

Finally, the clever work put together and discussed by the guys at RightScale and SmugMug, and countless other blog and forum postings I read during this project to keep me in the right directions.

Tags: , , , , , ,

23 Comments to How We Built a Web Hosting Infrastructure on EC2

Don MacAskill
July 19, 2008

Fascinating! Can’t wait until the database issue is solved more elegantly so we can do interesting customer-facing stuff up on EC2 as well…

Thrilled that people are seeing what we’re doing and finding it useful, too. :)

ubuntista
July 20, 2008

Great Article!

Mike, congrats on a nice set-up and thanks for the acknowledgments. I’m glad you like the simple RightScript way to config servers. I assume you saw that you can use the RightScale dashboard for free, might be handy ;-). Either way, best wishes for success!

mikebrittain
July 23, 2008

@Thorsten: To be honest, I don’t recall whether I looked closely at your dashboard or not. We knew early that we’d have to use a dashboard to tie everything together to keep management of the EC2 instances easy and organized. Will take another look. Thanks!

[...] How We Built a Web Hosting Infrastructure on EC2 (tags: amazon ec2 cluster 247up sysadmin)   [...]

[...] great article which I would like to highlight is How We Built a Web Hosting Infrastructure on EC2. Its a nice read if you are trying to host your website on Amazon [...]

gaiusparx
July 29, 2008

Thanks for the great article…

[...] The most interesting of the 4 options for Linux hosting is Amazon’s Elastic Compute Cloud (EC2). We haven’t done this ourselves, so instead I’ll point you to Mike Brittain’s article. [...]

Lonnie Wills
August 7, 2008

This is a great how to and is another example of how Amazon is creating some excellent utility based solutions to improve capacity with on demand models. Thank you for the detailed descriptions and diagrams.

Lonnie

W. Andrew Loe III
August 22, 2008

Can you discuss a little bit more about how your MySQL failover works? Do your webservers have a list of database servers to try or do you actually update their configs (with puppet or a bash script) when the main fails and the slave is promoted.

Do you just use an elastic IP and reassign it to the new master (doesn’t happen very quickly in my experience)?

mikebrittain
August 29, 2008

@AndrewLoe: We have a process that regularly polls our global config to see what databases are available. That process is responsible for identifying changes to the environment and updating appropriate application- and server-configs. If a memcache node were to drop out, or a new one becomes available, we would basically update a config file containing the nodes that can be connected to.

In the case of our databases we considered keeping that in an application config file, but instead decided to just update the /etc/hosts file on each application server. So the app is written to connect to “mysql-master” (or, whatever) and that hostname has an entry in the local /etc/hosts file for the current master (as determined from our global config). This means that the /etc/hosts file gets re-written from time to time by our environment-monitoring process.

Raam Dev
September 16, 2008

Thanks for this writeup!

I run a web hosting company and I’m considering eventually moving the entire thing over to EC2. Your research, and the various tools you mentioned, will be a godsend when that time comes!

[...] Mike Brittain posted an article explaining How We Built a Web Hosting Infrastructure on EC2. [...]

[...] Tour.  It’s worth a read, and is a good summary of our business case for building the EC2/S3 hosting platform that I led when I worked at [...]

[...] Continue reading Posted by Federico Filed in PHP, Programming, Software Architecture, Web Services [...]

David
September 29, 2008

Interesting article Mike. We’re in the process of setting up some EC2 servers and I’m wondering how you cope with redundancy? If a web server goes down, presumably the round robin dns still sends some clients to the dead web server. Do you reduce the impact of this problem with low TTL or did I miss something?

Mike Brittain
September 29, 2008

We kept TTLs very low (less than 5 minutes). With round-robin, end users (or in our case a set of Akamai proxies) should poll one server and if it’s down move onto another. I don’t recall how well/quickly the failover was working when we tested that.

Additionally, all boxes were monitored for uptime. When one went down, we could pull it out of the rotation or replace pretty quickly.

[...] I was reading something about cloud computing I surfed to cool website Animoto. The idea is simple, you select photos + music and Animoto makes [...]

guigouz
October 3, 2008

How about the bandwidth costs ? Are they worth running the infrastructure on EC2 (instead of rackspace) ?

Michael
October 3, 2008

If your are just using round robin DNS, how are you handling user sessions? Are your user sessions being kept in memcached for access by all web servers?

Mike Brittain
November 12, 2008

Sessions could be stored in memcache or in a database. Either way, as long as it’s shared storage and not on a local file system.

Mike Brittain
November 12, 2008

@guigouz The bandwidth cost is definitely cheaper than RackSpace. That should be a consideration on a case-by-case basis, with respect to wherever you are hosting your web site now.

Paul Lancaster
December 9, 2008

Mike — would be interested to see if we can give you another option with GoGrid — d me @paullancaster — we’re cheaper than AWS by a longshot