Re: Amazon EMR Best Practices for Hive metastore

Sam Wilson Tue, 06 Mar 2012 19:15:39 -0800

We also do #4. Initially we had lots of conversations about all the other 
options and we should do this or that... Ultimately we focused on just going 
live as quickly as possible and getting more involved in the setup later.


Since then the only thing we've needed to do is hack a few o the baseline 
scripts used by emr to launch hive so that it uses more heap. We definitely 
have a few pain points around partition recovery but those are things inherent 
to hive and not emr. 

I should note that we don't trust our emr cluster to stick around so we design 
for it to just die. You can't treat it like a regular Hadoop cluster. We made 
launching a new one an easy process and have decoupled hive from the ux so that 
it's fully asynchronous. 

So far, big wins and no complaints. 

Sent from my iPhone

On Mar 6, 2012, at 10:02 PM, Jeff Sternberg <jsternb...@spcapitaliq.com> wrote:

> Mark,
> 
> We do 4), basically. We have a simple hive script that does all the "create 
> external table" statements, and we run that script as step 1 of the EMR jobs 
> we spin up. Then our "real" processing takes over in step 2 and beyond. We're 
> only working with about 50 tables, so it's pretty manageable. A side benefit 
> is that we can put this create-table script under source control to track our 
> schema changes over time.
> 
> Jeff Sternberg
> S&P Capital IQ
> www.spcapitaliq.com
> 
> -----Original Message-----
> From: Mark Grover [mailto:mgro...@oanda.com] 
> Sent: Tuesday, March 06, 2012 9:54 PM
> To: user@hive.apache.org
> Cc: Baiju Devani; Denys Berestyuk
> Subject: Amazon EMR Best Practices for Hive metastore
> 
> Hi all,
> I am trying to get an idea of what people do for setting up Hive metastore 
> when using Amazon EMR.
> 
> For those of you using Amazon EMR:
> 
> 1) Do you have a dedicated RDS instance external to your EMR Hive+Hadoop 
> cluster that you use as a persistent metastore for all your cluster 
> instantiations?
> 
> 2) Do you use the MySQL DB that comes pre-installed on the master node and 
> export its data (on cluster tear down) to something like S3 and import it 
> from S3 during cluster bring up?
> 
> 3) Do you use a local installation of Hive (instead of that on EMR) so that 
> you could make use of an in-house dedicated metastore while utilizing Hadoop 
> cluster on EMR? (i.e. local Hive + EMR Hadoop)
> 
> 4) Do you do something really simple and naive like scripting up all your 
> "create external table" commands and running them every time you bring up a 
> cluster?
> 
> Or, do you do something else not mentioned above?:-)
> 
> Thank you in advance for sharing!
> 
> Mark
> 
> Mark Grover, Business Intelligence Analyst OANDA Corporation 
> 
> www: oanda.com www: fxtrade.com 
> 
> "Best Trading Platform" - World Finance's Forex Awards 2009. 
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009. 
> 
>

Re: Amazon EMR Best Practices for Hive metastore

Reply via email to