Mark, We do 4), basically. We have a simple hive script that does all the "create external table" statements, and we run that script as step 1 of the EMR jobs we spin up. Then our "real" processing takes over in step 2 and beyond. We're only working with about 50 tables, so it's pretty manageable. A side benefit is that we can put this create-table script under source control to track our schema changes over time.
Jeff Sternberg S&P Capital IQ www.spcapitaliq.com -----Original Message----- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Tuesday, March 06, 2012 9:54 PM To: user@hive.apache.org Cc: Baiju Devani; Denys Berestyuk Subject: Amazon EMR Best Practices for Hive metastore Hi all, I am trying to get an idea of what people do for setting up Hive metastore when using Amazon EMR. For those of you using Amazon EMR: 1) Do you have a dedicated RDS instance external to your EMR Hive+Hadoop cluster that you use as a persistent metastore for all your cluster instantiations? 2) Do you use the MySQL DB that comes pre-installed on the master node and export its data (on cluster tear down) to something like S3 and import it from S3 during cluster bring up? 3) Do you use a local installation of Hive (instead of that on EMR) so that you could make use of an in-house dedicated metastore while utilizing Hadoop cluster on EMR? (i.e. local Hive + EMR Hadoop) 4) Do you do something really simple and naive like scripting up all your "create external table" commands and running them every time you bring up a cluster? Or, do you do something else not mentioned above?:-) Thank you in advance for sharing! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009.