Thanks, Bejoy KS I know that output of RegexSerDe is String Array. so, RegexSerDer's sample ( https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData) is using "STORED AS TEXTFILE"
But, I am trying to use a compressed file, "STORED AS SEQUENCEFILE" support the compressed file. Could I have to use "STORED AS SEQUENCEFILE" to RegexSerDe? CREATE TABLE log_tb (ip STRING, dt STRING, kv Map<STRING, STRING>) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "~~~~~" "output.format.string" = "%1$s %2$s" ) *STORED AS SEQUENCEFILE;* --------------------------------------------- I use a "STORED AS TEXTFILE", do not splited with "carriage return(\n)" *So, the use of RegexSerDe and "STORED AS SEQUENCEFILE" is right?* The Below are the actual splited files. I have complex log files (compressed ".gz", 200G) on HDFS. *+ splited #1 (64M)* 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" ..... ..... 127.0.0.1 [2012Avg *+ splited #2 (64M)* 08] "a=abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" ..... ..... 127.0.0.1 [2012Avg08] "a= *+ splited #3 (64M)* abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" ..... ..... 127.0 *+ splited #3 (64M)* .0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" ..... ..... 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" 2012/8/18 Bejoy KS <bejoy...@yahoo.com> > ** > Hi Kiwon Lee > > There isn't anything specific you need to do in hive DDL or DML to parse > gz files. You need to ensure that 'org.apache.hadoop.io.compress.GzipCodec' > is availabe in 'io.compression.codecs' property within core-site.xml. > > To parse log files you can use RegexSerde. A sample DDL for loading Apache > log files can be found at > > https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData > > > You can create a partitioned table by using the 'PARTITIONED BY' clause > while creating a table. A sample DDL below > > CREATE TABLE page_view(viewTime INT, userid BIGINT, > page_url STRING, referrer_url STRING, > ip STRING COMMENT 'IP Address of the User') > COMMENT 'This is the page view table' > PARTITIONED BY(dt STRING, country STRING) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '1' > STORED AS SEQUENCEFILE; > > If your data is already partitioned in hdfs then you can create a > partitioned table and add partitions to the table by specifying the dir > corresponding to each partition using 'ALTER TABLE ADD PARTITION' statement. > > If the data is not partitioned in hdfs but would like to be partitioned in > hive then you can take a look at Dynamic Partition Insert. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Kiwon Lee <kiwoni....@gmail.com> > *Date: *Sat, 18 Aug 2012 00:29:20 +0900 > *To: *<user@hive.apache.org> > *ReplyTo: * user@hive.apache.org > *Subject: *how to handling complex log file(compressed, 200G) > > Hi, > > I have complex log files (compressed ".gz", 200G) on HDFS. > > + log file format : > 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad" > > I think DDL)), > CREATE TABLE log_tb (ip STRING, dt STRING, kv Map<STRING, STRING>) > ROW FORMAT SERDE "??" > STORED AS SEQUENCEFILE; > > I want the results below. > SELECT kv['b'] > FROM log_tb > LIMIT 10; > > > 1) How do I parsing to Complex log file (compressed(".gz", 200G) > > 2) If I have to SerDe, what SerDe should I use? > > 3) Does existed SerDe(input/output) by user define class? > > 4) If I use to partition with log file, how use to DDL, DML?..plz. sample > sql (DDL, DML) > > > Thanks. > -- ____________ *From*.* Kiwon Lee (Ethan Lee)* kiwoni....@samsung.com kiwoni....@gmail.com