Hi, On 24 Aug 2012, at 13:26, Ravi Shetye wrote:
> I have the data in s3 bucket in the following manner > > s3://logs/ad1date1.log.gz > s3://logs/ad1date2.log.gz > s3://logs/ad1date3.log.gz > s3://logs/ad1date4.log.gz > s3://logs/ad2date1.log.gz > s3://logs/ad2date2.log.gz > s3://logs/ad2date3.log.gz > s3://logs/ad2date4.log.gz > > If you do > CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING, > d2 STRING, > url STRING, > d4 STRING, > d5 STRING, > d6 STRING, > adv_id_dummy STRING, > timestp STRING, > ip STRING, > userAgent STRING, > stage STRING, > d12 STRING, > d13 STRING) LOCATION 's3n://logs/'; you'll have all of it in a table. If you then want the results partitioned, you can do CREATE EXTERNAL TABLE results (cookie STRING, d2 STRING, url STRING, d4 STRING, d5 STRING, d6 STRING, adv_id_dummy STRING, timestp STRING, ip STRING, userAgent STRING, stage STRING, d12 STRING, d13 STRING) PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://somewhere-outside-the-logs-tree'; You can then INSERT OVERWRITE TABLE results PARTITION (adv_id, date) <your query> Note that to use dynamic partitions you have to first run SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true; Cheers, Pedro Pedro Figueiredo Skype: pfig.89clouds http://89clouds.com/ - Big Data Consulting