Hi Kiwon Lee There isn't anything specific you need to do in hive DDL or DML to parse gz files. You need to ensure that 'org.apache.hadoop.io.compress.GzipCodec' is availabe in 'io.compression.codecs' property within core-site.xml.
To parse log files you can use RegexSerde. A sample DDL for loading Apache log files can be found at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData You can create a partitioned table by using the 'PARTITIONED BY' clause while creating a table. A sample DDL below CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE; If your data is already partitioned in hdfs then you can create a partitioned table and add partitions to the table by specifying the dir corresponding to each partition using 'ALTER TABLE ADD PARTITION' statement. If the data is not partitioned in hdfs but would like to be partitioned in hive then you can take a look at Dynamic Partition Insert. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Kiwon Lee <kiwoni....@gmail.com> Date: Sat, 18 Aug 2012 00:29:20 To: <user@hive.apache.org> Reply-To: user@hive.apache.org Subject: how to handling complex log file(compressed, 200G) Hi, I have complex log files (compressed ".gz", 200G) on HDFS. + log file format : 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" I think DDL)), CREATE TABLE log_tb (ip STRING, dt STRING, kv Map<STRING, STRING>) ROW FORMAT SERDE "??" STORED AS SEQUENCEFILE; I want the results below. SELECT kv['b'] FROM log_tb LIMIT 10; 1) How do I parsing to Complex log file (compressed(".gz", 200G) 2) If I have to SerDe, what SerDe should I use? 3) Does existed SerDe(input/output) by user define class? 4) If I use to partition with log file, how use to DDL, DML?..plz. sample sql (DDL, DML) Thanks.