RE: Hive on EMR on S3 : Beginner

richin.jain Mon, 27 Aug 2012 07:34:28 -0700

Hi Ravi,

The idea of using EMR is that you don't have to have a Hadoop cluster running 
all the time. So put all your data in S3, spin up an EMR cluster, do 
computation and store your data back in S3.
In an ideal case data in S3 should not be moved around and Hive will always 
read from S3 if you have defined S3 Location and table is external.


If you have some tables which you frequently access make them managed tables, 
hive stores the data for managed table in HDFS.
So you might create a managed table (without External keyword) result_managed, 
fields similar to result table and do something like

INSERT OVERWRITE result_managed SELECT * FROM result;

Basically you are copying the data from external table to a managed table, 
nothing else.
Another thing to note when you are using Hive in S3 is SET 
hive.optimize.s3.query=true; - amazon has done some optimizations of their own 
for hive to work with S3.

Hope this helps.

Thanks,
Richin

From: ext Ravi Shetye [mailto:ravi.she...@vizury.com]
Sent: Monday, August 27, 2012 8:58 AM
To: user@hive.apache.org
Subject: Re: Hive on EMR on S3 : Beginner

Thanks to all your help I have moved ahead with my project.
So I create table as
CREATE TABLE test (...)
PARTITIONED BY (adid STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://logs/'
Do a  ALTER TABLE results RECOVER PARTITIONS;

and then start querying.

Now the issue is it fetches data from s3 to hdfs for every single query. So if 
i remove the s3 buckets the result change

How can i remove this dependency? Store the data over HDFS and then query it 
repeatatively.

Am I even trying a valid use-case? or am I doing something fundamentally wrong?

RE: Hive on EMR on S3 : Beginner

Reply via email to