RE: loading datafiles in s3

Christopher, Pat Tue, 28 Jun 2011 11:46:09 -0700

allo,
1 dunno.  I generate my EMR scripts in a separate script so generating a stack 
of 'alter table...' queries is easy for me
2 event_b will have a null value in column 4.
2 b ( you didn't ask) what happens with this row:


  event_c user_id  france 500 afifthcolumn

afifthcolumn will be truncated and you'll have only event_c through 500 in the 
row

Pat

From: Kennon Lee [mailto:ken...@tinyco.com]
Sent: Monday, June 27, 2011 5:50 PM
To: user@hive.apache.org
Subject: loading datafiles in s3

Hello,
We're using hive on amazon elastic mapreduce to process logs on s3, and I had a 
couple basic questions. Apologies if they've been answered already-- I gathered 
most info from the hive tutorial on amazon 
(http://aws.amazon.com/articles/2855), as well as from skimming the hive wiki 
pages, but I'm still very new to all of this. So, questions:

1) Is it possible to partition on directories that do not have the "key=" 
prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2 and 
so ideally we could partition on that structure instead of adding "dt=" to 
every directory name. I found an old thread discussing this 
(http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
 but couldnt find the actual syntax.

2) How does hive handle tab-delimited files where rows sometimes have different 
column counts? For instance, if we are parsing an event log that contains 
multiple events, some of which have more columns associated with them:

event_a        user_id        apple          300
event_b        user_id        cat

If i define my hive table to have 4 columns, how will hive react to the event_b 
row?

Thanks!

RE: loading datafiles in s3

Reply via email to