Cloud Computing

Notes from AWS Start-Up Tour (NYC)

Posted by Mike Brittain on September 18, 2008
Cloud Computing / Comments Off on Notes from AWS Start-Up Tour (NYC)

I attended the Amazon Web Services “Start-Up Tour” in New York today.  Though I’ve been using AWS for some time now, I learned a few little bits that I thought I would share.  Some of these might have been in press releases that I missed, but I still thought they were interesting.

1. 400K registered developers. Surely not all of these are active developers, or even doing anything large.  But this seems like a pretty good developer base for a set of services that are mostly still in “beta”, and most people consider them to be bleeding edge.

2. “Muck”. This is the term that Amazon uses for all of that infrastructure that you shouldn’t need to build when you’re starting a company… because someone else has already done it.  Stop reinventing the wheel and get focused on your real business priorities.  I’ve heard this term before, but I love it.

3. Start-ups shouldn’t need an ops team. Using AWS, a start-up can get a real infrastructure setup without having to hire an operations team, go through capacity planning, purchase equipment, rent a rack in a colo, deal with power, bandwidth, security, etc.  Companies like RightScale can ease the implementation process, and at least one of the NYC panelists who was speaking at the Start-Up Tour made use of RightScale.  Slightly more expensive for hourly charges, but keep in mind that you won’t need a such a heavy-duty developer to manage your infrastructure.

4. S3 data redundancy. I was aware that data stored in S3 was replicated across multiple nodes, but according to Mike Culver (from AWS): 1.) S3 stripes across multiple nodes, 2.) Rebuilding a node doesn’t reduce performance of the striped system, and …drumroll… 3.) Data is replicated across multiple datacenters.  That’s the good part I didn’t know about.

5. 22 billion served. Well, not really “served”, but as of 2008 Q2, S3 has over 22 billion objects stored.

6. Upcoming products and features. Today, AWS announced that they will be providing a content delivery service for objects stored in S3 — high-speed, low-latency delivery.  Additionally, from what I heard talking with one of the evangelists, AWS is working on a long list of features requested by their customers.  When you think about a high-performance web application, there are a number of moving pieces — front-end servers, app servers, databases, storage, caching, load balancing, DNS, etc.  Lots of “muck” in there that isn’t already provided by AWS, but it sounds like they’re working on a number of these problems.  I’m looking forward to what may come out in the next year.

Tags: ,

Amazon Is Launching a CDN

Posted by Mike Brittain on September 18, 2008
Cloud Computing / Comments Off on Amazon Is Launching a CDN

Amazon Web Services is about to launch what appears to be a globally distributed delivery network for content stored in S3.  Details have just been released today about the new service, which looks to work in conjunction with existing S3 content… which is niiiice.  New domain names are handed out through an API, which presumably will CNAME to a node that is geographically close to the end users POP.

This should continue to heat up the CDN industry, which has already become quite competitive with a number of small players who have come on the scene over the last 2-3 years to challenge the big fish, Akamai.  I’ve had the pleasure of working with Akamai for deliverying Heavy’s content in the past, and can say that they set themselves apart from the rest of the industry by providing a wider range of products than just content delivery.

I’ve used S3 as an origin server to other CDNs in the past, and look forward to seeing how their own delivery network compares in speed, latency, and pricing.  On the surface, this looks amazing for start-ups and other small online businesses who may not be able to afford a large contingent of vendors.  My suspicion is that the pricing will be incredibly competitive, and look forward to seeing actual numbers.

Tags: , ,

Some Thoughts on Cloud Computing

Posted by Mike Brittain on August 17, 2008
Cloud Computing / 1 Comment

This post was republished by SYS-CON’s Cloud Computing Journal on Aug. 19, 2008.

What I’ve put together below are my thoughts following a recent panel on cloud computing in New York City.  Thanks to Murat Aktihanoglu at Unype for putting together the panel.  While the discussion was varied, and maybe disjointed, I came away with some new ideas.

This post hasn’t been well edited.  I apologize in advance.  These are mostly random notes I took away from the panel and my own experience with web hosting on the cloud.

What is “cloud computing”?

There are a variety of notions to how cloud computing is defined.  I tend to think that what this really boils down to is the ability to procure hardware or services that you wouldn’t normally have access to in a physical sense.  Rather than buying 20 new servers, you can spin them up on-demand, and also dump them whenever you want.  It’s the “utility” or “pay-as-you-go” model.

I don’t see any difference between spinning up one server to run some prototypes, or spinning up 100 to crunch through a huge data set.  People seem to be getting caught up in the notion that unless you are doing some sort of parallel processing with lots of nodes, you aren’t doing “cloud computing”.  I disagree.

I also don’t believe that virtualization is necessarily the same as cloud computing.  To me, virtualization means that you’re essentially splitting up fixed resources you already have into smaller chunks for other people to use.  This is your accounting and human resources departments sharing space on the same machine, but keeping them logically partitioned.  Providers are now selling virtualization under the cloud label.  But if I have to buy (or rent) 20 physical machines to virtualize into slices, then I’m still committed to 20 machines.  If I need more or fewer resources, I may need to work through a contract or serve out a lease term.  It’s no longer pay-as-you-go, it’s a major expenditure.

Software as a Service

I love the software as a service model.  I like having someone else running a database or mail service so that I don’t have to hire a team or own the plant to support it.  With the service being off-site, I don’t have to worry about local disasters (though be sure to watch out for providers without their own SLA or disaster recovery plans).  Our clients are again becoming thin.  Laptops will have fewer and fewer local applications installed, and simply access various online applications and databases.

Again, pay as you go.

Additionally, fewer staff to manage services in-house.  This means you won’t/can’t strangle your sysadmin when hosted email goes down for six hours.  That can be a good or a bad thing, depending on how you look at it.

The best part about this model, though, is that you focus your own resources on what you’re best at.  Does an online marketing agency need to know how to administer an Exchange server?  Or should that be outsourced to a company that has the expertise to run mail for over a hundred other companies?

Cloud != Scale

This seems like a typical misconception: If I build my application on a cloud computing platform, then it will automatically scale. Environments like EC2 provide the ability to scale your application horizontally.  Your application, however, still needs to be able to benefit from horizontal scaling.  If you can only handle 5 concurrent users per node, then adding more boxes isn’t going to get you to 10,000 users very quickly.  This seems obvious, but many people are still missing this point.

I don’t think there are many case studies yet of companies with applications “in the cloud” who also have suffered large amounts of traffic.  And when we do see more of these applications, they will tend to have been built by early adopters who are probably experts in their fields. These cloud services are not yet open and approachable enough so that you have your average developer poking around and building applications that have the DNA for failure.  Google has done a good job with promoting AppEngine using videos and hack-a-thons.

Decent architecture is always going to be foundational for scale.  Your application has to benefit from the availability of additional nodes.

Redundancy and Planning for Failure

Amazon gets a lot of heat when S3 goes down, or when Gmail is unavailable.  This is all a lot of finger pointing, especially by people have not started using cloud services — The “I told you so” crowd.  Truth be told, the day after the recent S3 outage, my company had an application that was offline for nearly the same amount of time as S3’s outage.  Are we any better?  No.

It’s incredibly important to have a failover option for your own application.  Before I left Heavy, we designed our storage on S3 so that it could be replicated to physical disks that we have at RackSpace.  When S3 went out, we just flipped over to the physical disks.  Eventually there will be a time when we don’t have enough disk to store what we keep at S3.  That doesn’t mean that it can’t be replicated to another cloud storage service.

Consider having a backup hosting service in place, either physical or using another cloud provider.  Your physical service could be provided by a managed hosting provider, or on some other dedicated hardware outside of your own office.  You don’t need to own your own servers for a backup solution.

If you don’t have much money to spend on physical machines to host your fully operating site or application, think about how you can reduce the site to a version that can be hosted on a minimal number of servers.  Can you maintain a read-only backup?  Can you host a backup of your most popular content (i.e. the top 5%), and temporarily turn off access to the rest of the site?

Abstraction Layers

Something that I have talked a lot about, but haven’t had enough time to spend building, is a good abstraction layer on top of cloud storage. Everyone seems to have slightly different APIs.  On the other hand, about 85% of the features overlap from provider to provider.  Why not write an abstraction layer to handle the 85% and use multiple services?  This could probably work pretty well for flipping back and forth between (or replicating amongst) various cloud storage services like S3, CloudFS, Nirvanix, and also physical disks.

I don’t know many details about SimpleDB and AppEngine’s datastore, but it seems to me that you may be able to apply this 85% rule to those as well.  You could probably even treat MySQL and PostgreSQL the same way.  You couldn’t use all of the joins and transactions you normally would want to use, but then again, writing an application specifically for cloud computing platforms seems to be a different sort of animal.  We’ve basically been doing the same thing for years with the so-called database abstraction layers.  You can say that you’ve got a layer that allows you to flip from one database engine to another, but chances are, you have some engine-specific code that you’ve been using that doesn’t translate well.

Porting an Application to EC2

I ported an application at Heavy that ran on physical machines we had available at RackSpace onto EC2.  How much effort did it take for the application developers?  Almost none.  We didn’t buy into using SimpleDB — we just ran MySQL on EC2 instances.  We split our team so that we had a couple of us building a few tools for managing our EC2 instances, and the other developers went about their business building a web application that could run on a standard LAMP stack.  Additionally, if EC2 ever goes out of commission, we have the code and databases backed.  They can easily be deployed to physical machines.

It’s worth saying this again… I ported an application from physical machines to the cloud.  This application was not written for a specific cloud service.  We were very concerned about lock-in from the beginning.

Conclusions

What did we gain by hosting our application on EC2?  Initially nothing.  We had the physical machines to run the application.  But as our traffic increases, we can fire up new instances on demand.  If traffic drops off, so does out monthly bill.  It’s variable cost web hosting.

Does hosting your application on EC2 solve scaling problems?  No.  If you can’t improve performance of your application by adding additional servers, then there are bottlenecks to solve.  Running your service on the cloud doesn’t mean it scales.

Furthermore, the cloud is not self-healing.  In other words, it doesn’t automatically monitor your application and grow your infrastructure.  That doesn’t mean, however, that you can’t build your application to do this.  Read Don MacAskill’s SkyNet posting to get some idea of how that can work.

I look forward to reading your comments.

Tags: , , , , , , , ,

How We Built a Web Hosting Infrastructure on EC2

Posted by Mike Brittain on July 19, 2008
Cloud Computing / 23 Comments

In the months prior to leaving Heavy, I led an exciting project to build a hosting platform for our online products on top of Amazon’s Elastic Compute Cloud (EC2).  We eventually launched our newest product at Heavy using EC2 as the primary hosting platform.

I’ve been following a lot of what other people have been doing with EC2 for data processing and handling big encoding or rendering jobs.  This is not one of those projects.

We set out to build a fairly standard LAMP hosting infrastructure where we could easily and quickly add additional capacity.  In fact, we can add new servers to our production pool in under 20 minutes, from the time we call the “run instance” API at EC2, to the time when public traffic begins hitting the new server.  This includes machine startup time, adding custom server config files and cron jobs, rolling out application code, running smoke tests, and adding the machine to public DNS.

What follows is a general outline of how we do this.

Wait!  But first, you should know that this article is pretty old and a lot has changed.  After you’ve read this article, please take a look at my follow up post: EC2 Hosting Architecture, Two Years Later.  — Mike

Architecture Summary

Heavy makes use of a pretty standard LAMP stack.  Administration scripts are written in PHP, Perl, or Bash.  There is a lot of caching in memory (memcached), file caching, and HTTP caching (Akamai).  The new site requires a layer of front-end web servers that double as application servers and a database layer (with replication).  The site is built entirely in PHP, making use of Zend Framework.  The database is MySQL.  We are not using Amazon’s SimpleDB service.

EC2 Hosting Architecture. Click for full-size image.

EC2 Hosting Architecture. Click for full-size image.

EC2 Images

I chose CentOS for the operating system on our machine images.  All of the machine images we built are designed specifically for their purpose, and there are a handful of them.  For example, web servers run Apache, PHP, and some Perl libraries.  Databases are installed with MySQL and PHP (for administration scripts).  Memcached nodes are built with memcached and barely anything else.

Thorsten von Eicken at RightScale has written  a lot of great material about their use of EC2.  I took a lot of ideas from their blog, including the use of RightScripts.  After banging around with some publicly available images, I started building my own by modifying their scripts for 32-bit and 64-bit Cent OS images.

It took a little getting used to the manner of building images with these scripts, and getting the right software packages installed.  Eventually it clicked and what I ended up with was a really simple script for building each type of machine that we would need.  Even better, it was simple to go back and re-configure any of the images and roll them into new machines.

Many thanks to Thorsten for providing these scripts.

Running Instances

Amazon provides some fantastic command line tools for managing EC2 instances and getting status on the service.  Unfortunately, these tools don’t really help much in terms of documenting what each server is doing. To keep track of this, I went to work building a control panel for our EC2 account that documents the roles for each machine and what products are running on it.  Our plan was not to run a single web site, but multiple sites/products each with their own database and web server clusters.

The control panel, by the way, lived on a physical machine of ours (at RackSpace), and not on EC2.

We realized early that all of these machines would need to know how to find each other.  Our control panel manages a global configuration file that lives in our S3 account that documents all of our servers’ roles. Every server is setup to to inspect this file and adjust its own application environment.  For example, when a new web server comes online it  grabs the configuration file and figures out which databases belong to the web site running on the instance.  If a database fails and a slave takes over as the new master, web servers can figure that out on their own without anyone manually logging in to change configs or host files.

Load Balancing

Although we had tested the EC2 servers to handle very high loads of traffic, we have the luxury of using Akamai’s Site Accelerator product in front of our web servers.  This allows for easy page caching in front of our web servers, and actually handles about 90% of the hits to the site.  Our web servers serve as the origin for Akamai’s proxy.  Rather than fooling around with additional servers to handle load balancing and configuring proper failover between them, we simple use round-robin DNS.  As it turns out, our load is very evenly distributed amongst the web servers.

Database Replication

Most people I’ve talked to about this setup want to know how we felt about hosting our database on EC2.  The best answer I can give is, “nervous”. Since EC2 doesn’t have persistent storage on machine instances, yet, we were liberal with setting up replication and backup servers.  A single master database is replicated to a slave (master candidate), and that slave replicates to a second slave (slave candidate).  Scripts were written to handle automated failover if the master becomes inaccessible; the master candidate is automatically be promoted to master, and our global configuration file is updated so that all of the web servers are aware of the change.

Furthermore, we run a second slave from the master database. This slave has a single role: dump snapshots of the database every 15 minutes and store them on S3.  If all of our EC2 instances should ever disappear from EC2, we have recent copies of the database on S3.

In all, 4 databases instances.  We’re being careful.

Server Configuration

Our images are designed to support a specific “role” for a machine, such as a database, web server, etc.  Once started, we identify “products” that will run on each machine.  These might be things like “HuskyMedia.com”, “Heavy.com”, “Video Encoding”, etc.  Obviously, each of our products requires its own set of service configurations (Apache, MySQL), users and groups, and cron jobs.

We chose Puppet to roll out these configurations to new machines (hat tip to Justin Shepherd at RackSpace for this suggestion).  If you’re not familiar with Puppet, you can create classes, or roles, for each of your servers.  In the classes, you define configurations files, cron jobs, packages to install, etc.  Finally, you identify the hosts that belong to each class.  When a new machine is started up (plain old vanilla), it checks into the Puppet “master” server, and the master sends over the proper configs.

When Puppet works, it totally rocks.  It has its drawbacks, however.  There is (what I consider) a steep learning curve for its configuration language.  It’s also still very much in development.  When we upgraded to a new software version, the master server didn’t seem to play well with clients that were still on the old version.  We jumped through some hoops to get all of our clients talking to the master server again.

On the upside, however, we must have gone through setting up over 100 machine instances using Puppet.  And that would have taken hundreds of hours in server administration to get each machine configured.  Additionally, Puppet can do package management, tying into whatever package manager you use in your Linux distribution.  If we had started using Puppet earlier, we might have stuck with two baseline machine images, one for 32-bit, and one for 64-bit architectures.  Then we could allow Puppet to handle all of the software installations.

Many thanks to the guys at Reductive Labs.  Puppet is a very cool piece of software!

Monitoring

We use two pieces of software for monitoring: Munin and Nagios.  The Munin server we use is the same that we use for our physical machines.  The simple configuration needed for Munin nodes is built into our machine images, along with the properly installed plugins.  The control panel we built for EC2 also updates our local Munin server configuration to listen for the new machines.  As soon as we start a new EC2 instance, it begins to show up in our reports.

Our Nagios configuration is a work in progress.  There are two installations, one that we use on our physical machines, and one the lives within EC2.  The EC2-based installation is monitored by the one installed on our physical machines.  It is not tied as tightly to our control panel, yet, but it seems likely for that to happen soon.

Availability Zones

Not to be overlooked, availability zones allow you to distribute your EC2 instances across separate fault-tolerant groups.  If one availability zone goes down, machines in other zones should theoretically be insulated from the same issue, i.e. separate power, separate network connectivity, etc.

We built a color-coded indicator in our control panel of the availability zone where each machine is running.  This makes it easy for us to make sure that we balance our servers equally throughout all of the zones.

Failover

It’s always handy to have a physical backup, especially since EC2 is currently still in “beta”.  Since our installation on EC2 uses essentially the same architecture as we use on our physical machines at RackSpace, it would be simple for us to move the entire site back to those servers.  In fact, most of the configurations are already in place.  We also use Neustar as our DNS provider, so we can keep very low TTLs on our hostnames.  When we need to change the location of our origin servers, it’d done in a matter of seconds.

Successes

Here are some successes we took away from this project:

  1. Twenty-minute start up time. Hands down, this is the most impressive for me.  We can spin up new machines and put them into production in under 20 minutes.  This isn’t (SkyNet), but it’s pretty darn cool.
  2. Loads of scripts and automation. We moved from mostly manual server administration, which we got used to by running only a few physical machines, to a much more automated process.  This improves our general workflow for server administration, whether the servers are virtual machines or physical machines.
  3. Documentation of images. Using image builders based on RightScripts, we have a catalog of what software goes into each new server, cleanly spelled out in Bash. :)
  4. Fault tolerance. We don’t know what is going to happen with our virtual machines.  We’ve seen some unexpected behavior from EC2, and have designed with that in consideration.
  5. Portable hosting. I didn’t want to build a hosting architecture just for Amazon Web Services.  I wanted to build a fairly standard LAMP stack, but one that is redundant.  We can take all of these learnings and re-apply them to the physical servers we still run at RackSpace.

Acknowledgments

While I researched and developed much of this project, I couldn’t have finished it off without the help of a couple of other guys at Heavy, Matt Spinks and Henry Cavillones.  Matt led the database effort and all of the scripting involved for our automated failover.  Henry took care of our monitoring needs, image maintenance, and helping me iron out some of the issues we were originally seeing with our Cent OS configuration for EC2.  Thanks, guys!

I also want to mention Scott Penberthy, our CTO, who kept us on track and was an excellent sounding board.  Without Scott at the top of this project, it wouldn’t have come together.  Thanks!

Finally, the clever work put together and discussed by the guys at RightScale and SmugMug, and countless other blog and forum postings I read during this project to keep me in the right directions.

Tags: , , , , , ,

Diversity Factor at Amazon Web Services

Posted by Mike Brittain on April 25, 2008
Cloud Computing / Comments Off on Diversity Factor at Amazon Web Services

Nicholas Carr makes a nice point about “diversity factor” within Amazon’s AWS clients.

We have been looking at a number of services in our own hosting environment that are “spikey”, things like image manipulation, video encoding, marketing email blasts, etc.  By offloading the spikes from our origin servers, we get better efficiency out of those machines — I.e. they are handling consistent loads throughout the year.  We push the spikes onto S3 and EC2.  Amazon’s clients have a wide variety of needs, which help to even out their loads.  I might be driving high load today at EC2, but tomorrow my traffic might be sleepy during someone else’s peak.

Tags: , , ,