Hi Ravi, The idea of using EMR is that you don't have to have a Hadoop cluster running all the time. So put all your data in S3, spin up an EMR cluster, do computation and store your data back in S3. In an ideal case data in S3 should not be moved around and Hive will always read from S3 if you have defined S3 Location and table is external.
If you have some tables which you frequently access make them managed tables, hive stores the data for managed table in HDFS. So you might create a managed table (without External keyword) result_managed, fields similar to result table and do something like INSERT OVERWRITE result_managed SELECT * FROM result; Basically you are copying the data from external table to a managed table, nothing else. Another thing to note when you are using Hive in S3 is SET hive.optimize.s3.query=true; - amazon has done some optimizations of their own for hive to work with S3. Hope this helps. Thanks, Richin From: ext Ravi Shetye [mailto:ravi.she...@vizury.com] Sent: Monday, August 27, 2012 8:58 AM To: user@hive.apache.org Subject: Re: Hive on EMR on S3 : Beginner Thanks to all your help I have moved ahead with my project. So I create table as CREATE TABLE test (...) PARTITIONED BY (adid STRING, dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://logs/' Do a ALTER TABLE results RECOVER PARTITIONS; and then start querying. Now the issue is it fetches data from s3 to hdfs for every single query. So if i remove the s3 buckets the result change How can i remove this dependency? Store the data over HDFS and then query it repeatatively. Am I even trying a valid use-case? or am I doing something fundamentally wrong?