|
thanks for the reply
Let concentrate on the second case*
*|*|CREATE EXTERNAL TABLE results (cookie STRING,
d2 STRING,
url STRING,
d4 STRING,
d5 STRING,
d6 STRING,
adv_id_dummy STRING,
timestp STRING,
ip STRING,
userAgent STRING,
stage STRING,
d12 STRING,
d13 STRING)
PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://somewhere-outside-the-logs-tree';
<s3://logs/joined_analyze_files_hive/%27>|*
*|
|*
*|You can then|*
*|
|*
|*INSERT OVERWRITE TABLE results PARTITION (adv_id, date)
Is this all I need to do to load the data?
*how will the system know what data will go into what partition?
As I understand the partition columns should be psedo columns and not part of
the actual data.
Also if I have to load just 2 of the files say|
|s3://logs/ad1date1.log.gz and|
|s3://logs/ad2date4.log.gz how do I specify it.|
On 08/24/2012 06:16 PM, Pedro Figueiredo wrote:
Hi,
On 24 Aug 2012, at 13:26, Ravi Shetye wrote:
I have the data in s3 bucket in the following manner
|s3://logs/ad1date1.log.gz
s3://logs/ad1date2.log.gz
s3://logs/ad1date3.log.gz
s3://logs/ad1date4.log.gz
s3://logs/ad2date1.log.gz
s3://logs/ad2date2.log.gz
s3://logs/ad2date3.log.gz
s3://logs/ad2date4.log.gz
|
If you do
|CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,
d2 STRING,
url STRING,
d4 STRING,
d5 STRING,
d6 STRING,
adv_id_dummy STRING,
timestp STRING,
ip STRING,
userAgent STRING,
stage STRING,
d12 STRING,
d13 STRING)|
LOCATION 's3n://logs/' <s3n://logs/%27>;
you'll have all of it in a table. If you then want the results
partitioned, you can do
|CREATE EXTERNAL TABLE results (cookie STRING,
d2 STRING,
url STRING,
d4 STRING,
d5 STRING,
d6 STRING,
adv_id_dummy STRING,
timestp STRING,
ip STRING,
userAgent STRING,
stage STRING,
d12 STRING,
d13 STRING)
PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://somewhere-outside-the-logs-tree';
<s3://logs/joined_analyze_files_hive/%27>|
|
|
|You can then|
|
|
|INSERT OVERWRITE TABLE results PARTITION (adv_id, date)|
|<your query>|
|
|
|Note that to use dynamic partitions you have to first run|
|SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;|
Cheers,
Pedro
Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting