Apache Whirr is also an option for building a hadoop cluster on ec2, this
allows you a more cloud neutral approach, also eases the pain on in-housing
it later if you need to

http://whirr.apache.org/
Guy

On Tue, Nov 22, 2011 at 12:47 PM, Mark Grover <mgro...@oanda.com> wrote:

> Here is another article that might be insightful for you:
>
> http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%E2%80%99s-distribution-including-apache-hadoop-cluster/?s-distribution-including-apache-hadoop-cluster/
>
> Sam raised some valid points and going with Amazon definitely is a
> (relatively) hassle free way to get started especially when one is
> constrained w.r.t resources related to managing of internal cluster.
>
> Mark
>
> ----- Original Message -----
> From: "Sam Wilson" <swil...@monetate.com>
> To: user@hive.apache.org
> Sent: Tuesday, November 22, 2011 3:38:01 PM
> Subject: Re: Building out Hive in EC2/S3 versus dedicated servers
>
> We recently adopted Hadoop and Hive for doing some significant data
> processing. We went the Amazon route.
>
>
> My own $.02 is as follows:
>
>
> If you are already incredibly experienced with Hadoop and Hive and have
> someone on staff who has previously built a cluster at least as big as the
> one you are projecting to require, then simply do some back of the envelope
> calculations and decide if it is cost effective to run on your own system
> given all your other business constraints. If you don't know how to do
> this, then you aren't sufficiently experienced to go this route.
>
>
> If you are new to Hadoop and Hive, then your best bet is to build your
> application first, using EMR as a prototype cluster. If your data is
> already loaded into S3 or you are already using Amazon, then this is also a
> no brainer way to get started. Hadoop and Hive are not what I would call
> user friendly. Frankly, they are full of bugs, and gotchas and are poorly
> documented. The learning curve is a bit steep. The most important thing is
> to prove out your functionality and build a system that delivers value
> quickly. You don't want your deadline to pass with only a pretty rack of
> servers to show for it. You need functionality.
>
>
> EMR lets you focus on your application, your code, your requirements,
> without having to deal with the details of the infrastructure. I simply
> cannot stress how nice it has been for us to be able to spin up new
> clusters on-the-fly while we were developing our application. Our ability
> to rapidly prototype has simply blown me away.
>
>
> Once you've got yourself up and running, your application is doing what
> it's supposed to, and you've built some familiarity with Hadoop and Hive,
> my suggestion is to then build a prototype cluster either hosted or in your
> office. Familiarize yourself with all the network, OS and other low-level
> details. Do some analysis on cost/performance, then decide whether or not
> to move your production system from Amazon to somewhere else.
>
>
> Everyone's application is going to be very unique to them, so looking at
> someone else's calculations is largely pointless.
>
>
> In our experience how did this pan out? We rebuilt a major system
> component in 3 months, reducing query times for certain jobs from 16+ days
> to 4 minutes. We did not purchase a single piece of hardware, or install a
> single piece of software we did not write ourselves. We have the ability to
> rapidly redeploy our system in any of 5 different data centers around the
> world at the flip of a few switches. If we wanted to deploy on our own
> hardware or in a colo at this point, we would only have to focus on
> building the cluster.
>
>
> Our app is already built, serving our customers and making us money.
>
>
> YMMV.
>
>
>
>
>
> On Nov 22, 2011, at 3:15 PM, Loren Siebert wrote:
>
>
>
>
> My colleague has a Heroku-based startup and they are just getting started
> with Hadoop and Hive. They’re evaluating running Hive in EC2/S3 versus
> buying a handful of boxes and installing CDH.
>
>
> One nice (albeit dated) analysis on this question is here, but I’m curious
> if anyone here has a different take on it:
>
> http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/
>
>
> What is the sweet spot for when a Hive warehouse in EC2 makes the most
> sense?
>
>
> I’m asking on this Hive list versus the more general Hadoop lists because
> I think a solution for a Hive cluster could differ quite a bit from a
> solution for a HBase cluster.
>
>
> - Loren
>

Reply via email to