Re: Building out Hive in EC2/S3 versus dedicated servers

Mark Grover Tue, 22 Nov 2011 12:48:08 -0800

Here is another article that might be insightful for you:
http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%E2%80%99s-distribution-including-apache-hadoop-cluster/?s-distribution-including-apache-hadoop-cluster/

Sam raised some valid points and going with Amazon definitely is a (relatively) 
hassle free way to get started especially when one is constrained w.r.t 
resources related to managing of internal cluster.

Mark

----- Original Message -----
From: "Sam Wilson" <swil...@monetate.com>
To: user@hive.apache.org
Sent: Tuesday, November 22, 2011 3:38:01 PM
Subject: Re: Building out Hive in EC2/S3 versus dedicated servers

We recently adopted Hadoop and Hive for doing some significant data processing. 
We went the Amazon route.

My own $.02 is as follows:

If you are already incredibly experienced with Hadoop and Hive and have someone 
on staff who has previously built a cluster at least as big as the one you are 
projecting to require, then simply do some back of the envelope calculations 
and decide if it is cost effective to run on your own system given all your 
other business constraints. If you don't know how to do this, then you aren't 
sufficiently experienced to go this route.

If you are new to Hadoop and Hive, then your best bet is to build your 
application first, using EMR as a prototype cluster. If your data is already 
loaded into S3 or you are already using Amazon, then this is also a no brainer 
way to get started. Hadoop and Hive are not what I would call user friendly. 
Frankly, they are full of bugs, and gotchas and are poorly documented. The 
learning curve is a bit steep. The most important thing is to prove out your 
functionality and build a system that delivers value quickly. You don't want 
your deadline to pass with only a pretty rack of servers to show for it. You 
need functionality.

EMR lets you focus on your application, your code, your requirements, without 
having to deal with the details of the infrastructure. I simply cannot stress 
how nice it has been for us to be able to spin up new clusters on-the-fly while 
we were developing our application. Our ability to rapidly prototype has simply 
blown me away.

Once you've got yourself up and running, your application is doing what it's 
supposed to, and you've built some familiarity with Hadoop and Hive, my 
suggestion is to then build a prototype cluster either hosted or in your 
office. Familiarize yourself with all the network, OS and other low-level 
details. Do some analysis on cost/performance, then decide whether or not to 
move your production system from Amazon to somewhere else.

Everyone's application is going to be very unique to them, so looking at 
someone else's calculations is largely pointless.

In our experience how did this pan out? We rebuilt a major system component in 
3 months, reducing query times for certain jobs from 16+ days to 4 minutes. We 
did not purchase a single piece of hardware, or install a single piece of 
software we did not write ourselves. We have the ability to rapidly redeploy 
our system in any of 5 different data centers around the world at the flip of a 
few switches. If we wanted to deploy on our own hardware or in a colo at this 
point, we would only have to focus on building the cluster.

Our app is already built, serving our customers and making us money.

YMMV.

On Nov 22, 2011, at 3:15 PM, Loren Siebert wrote:

My colleague has a Heroku-based startup and they are just getting started with 
Hadoop and Hive. They’re evaluating running Hive in EC2/S3 versus buying a 
handful of boxes and installing CDH.

One nice (albeit dated) analysis on this question is here, but I’m curious if 
anyone here has a different take on it:
http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/

What is the sweet spot for when a Hive warehouse in EC2 makes the most sense?

I’m asking on this Hive list versus the more general Hadoop lists because I 
think a solution for a Hive cluster could differ quite a bit from a solution 
for a HBase cluster.

- Loren

Re: Building out Hive in EC2/S3 versus dedicated servers

Reply via email to