My colleague has a Heroku-based startup and they are just getting started with Hadoop and Hive. They’re evaluating running Hive in EC2/S3 versus buying a handful of boxes and installing CDH.
One nice (albeit dated) analysis on this question is here, but I’m curious if anyone here has a different take on it: http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/ What is the sweet spot for when a Hive warehouse in EC2 makes the most sense? I’m asking on this Hive list versus the more general Hadoop lists because I think a solution for a Hive cluster could differ quite a bit from a solution for a HBase cluster. - Loren