Apache Whirr is also an option for building a hadoop cluster on ec2, this allows you a more cloud neutral approach, also eases the pain on in-housing it later if you need to
http://whirr.apache.org/ Guy On Tue, Nov 22, 2011 at 12:47 PM, Mark Grover <mgro...@oanda.com> wrote: > Here is another article that might be insightful for you: > > http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%E2%80%99s-distribution-including-apache-hadoop-cluster/?s-distribution-including-apache-hadoop-cluster/ > > Sam raised some valid points and going with Amazon definitely is a > (relatively) hassle free way to get started especially when one is > constrained w.r.t resources related to managing of internal cluster. > > Mark > > ----- Original Message ----- > From: "Sam Wilson" <swil...@monetate.com> > To: user@hive.apache.org > Sent: Tuesday, November 22, 2011 3:38:01 PM > Subject: Re: Building out Hive in EC2/S3 versus dedicated servers > > We recently adopted Hadoop and Hive for doing some significant data > processing. We went the Amazon route. > > > My own $.02 is as follows: > > > If you are already incredibly experienced with Hadoop and Hive and have > someone on staff who has previously built a cluster at least as big as the > one you are projecting to require, then simply do some back of the envelope > calculations and decide if it is cost effective to run on your own system > given all your other business constraints. If you don't know how to do > this, then you aren't sufficiently experienced to go this route. > > > If you are new to Hadoop and Hive, then your best bet is to build your > application first, using EMR as a prototype cluster. If your data is > already loaded into S3 or you are already using Amazon, then this is also a > no brainer way to get started. Hadoop and Hive are not what I would call > user friendly. Frankly, they are full of bugs, and gotchas and are poorly > documented. The learning curve is a bit steep. The most important thing is > to prove out your functionality and build a system that delivers value > quickly. You don't want your deadline to pass with only a pretty rack of > servers to show for it. You need functionality. > > > EMR lets you focus on your application, your code, your requirements, > without having to deal with the details of the infrastructure. I simply > cannot stress how nice it has been for us to be able to spin up new > clusters on-the-fly while we were developing our application. Our ability > to rapidly prototype has simply blown me away. > > > Once you've got yourself up and running, your application is doing what > it's supposed to, and you've built some familiarity with Hadoop and Hive, > my suggestion is to then build a prototype cluster either hosted or in your > office. Familiarize yourself with all the network, OS and other low-level > details. Do some analysis on cost/performance, then decide whether or not > to move your production system from Amazon to somewhere else. > > > Everyone's application is going to be very unique to them, so looking at > someone else's calculations is largely pointless. > > > In our experience how did this pan out? We rebuilt a major system > component in 3 months, reducing query times for certain jobs from 16+ days > to 4 minutes. We did not purchase a single piece of hardware, or install a > single piece of software we did not write ourselves. We have the ability to > rapidly redeploy our system in any of 5 different data centers around the > world at the flip of a few switches. If we wanted to deploy on our own > hardware or in a colo at this point, we would only have to focus on > building the cluster. > > > Our app is already built, serving our customers and making us money. > > > YMMV. > > > > > > On Nov 22, 2011, at 3:15 PM, Loren Siebert wrote: > > > > > My colleague has a Heroku-based startup and they are just getting started > with Hadoop and Hive. They’re evaluating running Hive in EC2/S3 versus > buying a handful of boxes and installing CDH. > > > One nice (albeit dated) analysis on this question is here, but I’m curious > if anyone here has a different take on it: > > http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/ > > > What is the sweet spot for when a Hive warehouse in EC2 makes the most > sense? > > > I’m asking on this Hive list versus the more general Hadoop lists because > I think a solution for a Hive cluster could differ quite a bit from a > solution for a HBase cluster. > > > - Loren >