I think the answer to 1 is No but you can confirm on the AWS EMR forum. The problem I've been having is that if you have x=foo in the prefix of your S3 path, EMR will try to use it as part of your partitioning key even if you don't want it. Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get confused. Sometimes it works, other times it complains that x is not part of your INSERT .. PARTITION(y) clause. I haven't quite figured out when and why.
On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat < patrick.christop...@hp.com> wrote: > allo,**** > > 1 dunno. I generate my EMR scripts in a separate script so generating a > stack of ‘alter table…’ queries is easy for me**** > > 2 event_b will have a null value in column 4.**** > > 2 b ( you didn’t ask) what happens with this row:**** > > ** ** > > event_c user_id france 500 afifthcolumn**** > > ** ** > > afifthcolumn will be truncated and you’ll have only event_c through 500 in > the row**** > > ** ** > > Pat**** > > ** ** > > *From:* Kennon Lee [mailto:ken...@tinyco.com] > *Sent:* Monday, June 27, 2011 5:50 PM > *To:* user@hive.apache.org > *Subject:* loading datafiles in s3**** > > ** ** > > Hello,**** > > We're using hive on amazon elastic mapreduce to process logs on s3, and I > had a couple basic questions. Apologies if they've been answered already-- I > gathered most info from the hive tutorial on amazon ( > http://aws.amazon.com/articles/2855), as well as from skimming the hive > wiki pages, but I'm still very new to all of this. So, questions:**** > > ** ** > > 1) Is it possible to partition on directories that do not have the "key=" > prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2 > and so ideally we could partition on that structure instead of adding "dt=" > to every directory name. I found an old thread discussing this ( > http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>) > but couldnt find the actual syntax.**** > > ** ** > > 2) How does hive handle tab-delimited files where rows sometimes have > different column counts? For instance, if we are parsing an event log that > contains multiple events, some of which have more columns associated with > them:**** > > ** ** > > event_a user_id apple 300**** > > event_b user_id cat**** > > ** ** > > If i define my hive table to have 4 columns, how will hive react to the > event_b row?**** > > ** ** > > Thanks!**** > > ** ** >