In the months prior to leaving Heavy, I led an exciting project to build a hosting platform for our online products on top of Amazon’s Elastic Compute Cloud (EC2). We eventually launched our newest product at Heavy using EC2 as the primary hosting platform.
I’ve been following a lot of what other people have been doing with EC2 for data processing and handling big encoding or rendering jobs. This is not one of those projects.
We set out to build a fairly standard LAMP hosting infrastructure where we could easily and quickly add additional capacity. In fact, we can add new servers to our production pool in under 20 minutes, from the time we call the “run instance” API at EC2, to the time when public traffic begins hitting the new server. This includes machine startup time, adding custom server config files and cron jobs, rolling out application code, running smoke tests, and adding the machine to public DNS.
What follows is a general outline of how we do this.
Wait! But first, you should know that this article is pretty old and a lot has changed. After you’ve read this article, please take a look at my follow up post: EC2 Hosting Architecture, Two Years Later. — Mike
Heavy makes use of a pretty standard LAMP stack. Administration scripts are written in PHP, Perl, or Bash. There is a lot of caching in memory (memcached), file caching, and HTTP caching (Akamai). The new site requires a layer of front-end web servers that double as application servers and a database layer (with replication). The site is built entirely in PHP, making use of Zend Framework. The database is MySQL. We are not using Amazon’s SimpleDB service.
I chose CentOS for the operating system on our machine images. All of the machine images we built are designed specifically for their purpose, and there are a handful of them. For example, web servers run Apache, PHP, and some Perl libraries. Databases are installed with MySQL and PHP (for administration scripts). Memcached nodes are built with memcached and barely anything else.
Thorsten von Eicken at RightScale has written a lot of great material about their use of EC2. I took a lot of ideas from their blog, including the use of RightScripts. After banging around with some publicly available images, I started building my own by modifying their scripts for 32-bit and 64-bit Cent OS images.
It took a little getting used to the manner of building images with these scripts, and getting the right software packages installed. Eventually it clicked and what I ended up with was a really simple script for building each type of machine that we would need. Even better, it was simple to go back and re-configure any of the images and roll them into new machines.
Many thanks to Thorsten for providing these scripts.
Amazon provides some fantastic command line tools for managing EC2 instances and getting status on the service. Unfortunately, these tools don’t really help much in terms of documenting what each server is doing. To keep track of this, I went to work building a control panel for our EC2 account that documents the roles for each machine and what products are running on it. Our plan was not to run a single web site, but multiple sites/products each with their own database and web server clusters.
The control panel, by the way, lived on a physical machine of ours (at RackSpace), and not on EC2.
We realized early that all of these machines would need to know how to find each other. Our control panel manages a global configuration file that lives in our S3 account that documents all of our servers’ roles. Every server is setup to to inspect this file and adjust its own application environment. For example, when a new web server comes online it grabs the configuration file and figures out which databases belong to the web site running on the instance. If a database fails and a slave takes over as the new master, web servers can figure that out on their own without anyone manually logging in to change configs or host files.
Although we had tested the EC2 servers to handle very high loads of traffic, we have the luxury of using Akamai’s Site Accelerator product in front of our web servers. This allows for easy page caching in front of our web servers, and actually handles about 90% of the hits to the site. Our web servers serve as the origin for Akamai’s proxy. Rather than fooling around with additional servers to handle load balancing and configuring proper failover between them, we simple use round-robin DNS. As it turns out, our load is very evenly distributed amongst the web servers.
Most people I’ve talked to about this setup want to know how we felt about hosting our database on EC2. The best answer I can give is, “nervous”. Since EC2 doesn’t have persistent storage on machine instances, yet, we were liberal with setting up replication and backup servers. A single master database is replicated to a slave (master candidate), and that slave replicates to a second slave (slave candidate). Scripts were written to handle automated failover if the master becomes inaccessible; the master candidate is automatically be promoted to master, and our global configuration file is updated so that all of the web servers are aware of the change.
Furthermore, we run a second slave from the master database. This slave has a single role: dump snapshots of the database every 15 minutes and store them on S3. If all of our EC2 instances should ever disappear from EC2, we have recent copies of the database on S3.
In all, 4 databases instances. We’re being careful.
Our images are designed to support a specific “role” for a machine, such as a database, web server, etc. Once started, we identify “products” that will run on each machine. These might be things like “HuskyMedia.com”, “Heavy.com”, “Video Encoding”, etc. Obviously, each of our products requires its own set of service configurations (Apache, MySQL), users and groups, and cron jobs.
We chose Puppet to roll out these configurations to new machines (hat tip to Justin Shepherd at RackSpace for this suggestion). If you’re not familiar with Puppet, you can create classes, or roles, for each of your servers. In the classes, you define configurations files, cron jobs, packages to install, etc. Finally, you identify the hosts that belong to each class. When a new machine is started up (plain old vanilla), it checks into the Puppet “master” server, and the master sends over the proper configs.
When Puppet works, it totally rocks. It has its drawbacks, however. There is (what I consider) a steep learning curve for its configuration language. It’s also still very much in development. When we upgraded to a new software version, the master server didn’t seem to play well with clients that were still on the old version. We jumped through some hoops to get all of our clients talking to the master server again.
On the upside, however, we must have gone through setting up over 100 machine instances using Puppet. And that would have taken hundreds of hours in server administration to get each machine configured. Additionally, Puppet can do package management, tying into whatever package manager you use in your Linux distribution. If we had started using Puppet earlier, we might have stuck with two baseline machine images, one for 32-bit, and one for 64-bit architectures. Then we could allow Puppet to handle all of the software installations.
Many thanks to the guys at Reductive Labs. Puppet is a very cool piece of software!
We use two pieces of software for monitoring: Munin and Nagios. The Munin server we use is the same that we use for our physical machines. The simple configuration needed for Munin nodes is built into our machine images, along with the properly installed plugins. The control panel we built for EC2 also updates our local Munin server configuration to listen for the new machines. As soon as we start a new EC2 instance, it begins to show up in our reports.
Our Nagios configuration is a work in progress. There are two installations, one that we use on our physical machines, and one the lives within EC2. The EC2-based installation is monitored by the one installed on our physical machines. It is not tied as tightly to our control panel, yet, but it seems likely for that to happen soon.
Not to be overlooked, availability zones allow you to distribute your EC2 instances across separate fault-tolerant groups. If one availability zone goes down, machines in other zones should theoretically be insulated from the same issue, i.e. separate power, separate network connectivity, etc.
We built a color-coded indicator in our control panel of the availability zone where each machine is running. This makes it easy for us to make sure that we balance our servers equally throughout all of the zones.
It’s always handy to have a physical backup, especially since EC2 is currently still in “beta”. Since our installation on EC2 uses essentially the same architecture as we use on our physical machines at RackSpace, it would be simple for us to move the entire site back to those servers. In fact, most of the configurations are already in place. We also use Neustar as our DNS provider, so we can keep very low TTLs on our hostnames. When we need to change the location of our origin servers, it’d done in a matter of seconds.
Here are some successes we took away from this project:
- Twenty-minute start up time. Hands down, this is the most impressive for me. We can spin up new machines and put them into production in under 20 minutes. This isn’t (SkyNet), but it’s pretty darn cool.
- Loads of scripts and automation. We moved from mostly manual server administration, which we got used to by running only a few physical machines, to a much more automated process. This improves our general workflow for server administration, whether the servers are virtual machines or physical machines.
- Documentation of images. Using image builders based on RightScripts, we have a catalog of what software goes into each new server, cleanly spelled out in Bash. :)
- Fault tolerance. We don’t know what is going to happen with our virtual machines. We’ve seen some unexpected behavior from EC2, and have designed with that in consideration.
- Portable hosting. I didn’t want to build a hosting architecture just for Amazon Web Services. I wanted to build a fairly standard LAMP stack, but one that is redundant. We can take all of these learnings and re-apply them to the physical servers we still run at RackSpace.
While I researched and developed much of this project, I couldn’t have finished it off without the help of a couple of other guys at Heavy, Matt Spinks and Henry Cavillones. Matt led the database effort and all of the scripting involved for our automated failover. Henry took care of our monitoring needs, image maintenance, and helping me iron out some of the issues we were originally seeing with our Cent OS configuration for EC2. Thanks, guys!
I also want to mention Scott Penberthy, our CTO, who kept us on track and was an excellent sounding board. Without Scott at the top of this project, it wouldn’t have come together. Thanks!
Finally, the clever work put together and discussed by the guys at RightScale and SmugMug, and countless other blog and forum postings I read during this project to keep me in the right directions.