Here is another article that might be insightful for you: http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%E2%80%99s-distribution-including-apache-hadoop-cluster/?s-distribution-including-apache-hadoop-cluster/
Sam raised some valid points and going with Amazon definitely is a (relatively) hassle free way to get started especially when one is constrained w.r.t resources related to managing of internal cluster. Mark ----- Original Message ----- From: "Sam Wilson" <swil...@monetate.com> To: user@hive.apache.org Sent: Tuesday, November 22, 2011 3:38:01 PM Subject: Re: Building out Hive in EC2/S3 versus dedicated servers We recently adopted Hadoop and Hive for doing some significant data processing. We went the Amazon route. My own $.02 is as follows: If you are already incredibly experienced with Hadoop and Hive and have someone on staff who has previously built a cluster at least as big as the one you are projecting to require, then simply do some back of the envelope calculations and decide if it is cost effective to run on your own system given all your other business constraints. If you don't know how to do this, then you aren't sufficiently experienced to go this route. If you are new to Hadoop and Hive, then your best bet is to build your application first, using EMR as a prototype cluster. If your data is already loaded into S3 or you are already using Amazon, then this is also a no brainer way to get started. Hadoop and Hive are not what I would call user friendly. Frankly, they are full of bugs, and gotchas and are poorly documented. The learning curve is a bit steep. The most important thing is to prove out your functionality and build a system that delivers value quickly. You don't want your deadline to pass with only a pretty rack of servers to show for it. You need functionality. EMR lets you focus on your application, your code, your requirements, without having to deal with the details of the infrastructure. I simply cannot stress how nice it has been for us to be able to spin up new clusters on-the-fly while we were developing our application. Our ability to rapidly prototype has simply blown me away. Once you've got yourself up and running, your application is doing what it's supposed to, and you've built some familiarity with Hadoop and Hive, my suggestion is to then build a prototype cluster either hosted or in your office. Familiarize yourself with all the network, OS and other low-level details. Do some analysis on cost/performance, then decide whether or not to move your production system from Amazon to somewhere else. Everyone's application is going to be very unique to them, so looking at someone else's calculations is largely pointless. In our experience how did this pan out? We rebuilt a major system component in 3 months, reducing query times for certain jobs from 16+ days to 4 minutes. We did not purchase a single piece of hardware, or install a single piece of software we did not write ourselves. We have the ability to rapidly redeploy our system in any of 5 different data centers around the world at the flip of a few switches. If we wanted to deploy on our own hardware or in a colo at this point, we would only have to focus on building the cluster. Our app is already built, serving our customers and making us money. YMMV. On Nov 22, 2011, at 3:15 PM, Loren Siebert wrote: My colleague has a Heroku-based startup and they are just getting started with Hadoop and Hive. They’re evaluating running Hive in EC2/S3 versus buying a handful of boxes and installing CDH. One nice (albeit dated) analysis on this question is here, but I’m curious if anyone here has a different take on it: http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/ What is the sweet spot for when a Hive warehouse in EC2 makes the most sense? I’m asking on this Hive list versus the more general Hadoop lists because I think a solution for a Hive cluster could differ quite a bit from a solution for a HBase cluster. - Loren