Thanks, Bejoy KS

I know that output of RegexSerDe is String Array.
so, RegexSerDer's sample (
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData)
is using "STORED AS TEXTFILE"

But, I am trying to use a compressed file, "STORED AS SEQUENCEFILE" support
the compressed file.
Could I have to use "STORED AS SEQUENCEFILE" to RegexSerDe?

CREATE TABLE log_tb (ip STRING, dt STRING, kv Map<STRING, STRING>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
   "input.regex" = "~~~~~"
   "output.format.string" = "%1$s %2$s"
)
*STORED AS SEQUENCEFILE;*

---------------------------------------------

I use a "STORED AS TEXTFILE", do not splited with "carriage return(\n)"

*So, the use of RegexSerDe and "STORED AS SEQUENCEFILE" is right?*

The Below are the actual splited files.
I have complex log files (compressed ".gz", 200G) on HDFS.

*+ splited #1 (64M)*
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
.....
.....
127.0.0.1 [2012Avg


*+ splited #2 (64M)*
08] "a=abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
.....
.....
127.0.0.1 [2012Avg08] "a=


*+ splited #3 (64M)*
abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
.....
.....
127.0


*+ splited #3 (64M)*
.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
.....
.....
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"



2012/8/18 Bejoy KS <bejoy...@yahoo.com>

> **
> Hi Kiwon Lee
>
> There isn't anything specific you need to do in hive DDL or DML to parse
> gz files. You need to ensure that 'org.apache.hadoop.io.compress.GzipCodec'
> is availabe in 'io.compression.codecs' property within core-site.xml.
>
> To parse log files you can use RegexSerde. A sample DDL for loading Apache
> log files can be found at
>
> https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData
>
>
> You can create a partitioned table by using the 'PARTITIONED BY' clause
> while creating a table. A sample DDL below
>
> CREATE TABLE page_view(viewTime INT, userid BIGINT,
> page_url STRING, referrer_url STRING,
> ip STRING COMMENT 'IP Address of the User')
> COMMENT 'This is the page view table'
> PARTITIONED BY(dt STRING, country STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '1'
> STORED AS SEQUENCEFILE;
>
> If your data is already partitioned in hdfs then you can create a
> partitioned table and add partitions to the table by specifying the dir
> corresponding to each partition using 'ALTER TABLE ADD PARTITION' statement.
>
> If the data is not partitioned in hdfs but would like to be partitioned in
> hive then you can take a look at Dynamic Partition Insert.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Kiwon Lee <kiwoni....@gmail.com>
> *Date: *Sat, 18 Aug 2012 00:29:20 +0900
> *To: *<user@hive.apache.org>
> *ReplyTo: * user@hive.apache.org
> *Subject: *how to handling complex log file(compressed, 200G)
>
> Hi,
>
> I have complex log files (compressed ".gz", 200G) on HDFS.
>
> + log file format :
> 127.0.0.1 [2012Avg08] "a=abc&b=adf&c=aadfad"
>
> I think DDL)),
> CREATE TABLE log_tb (ip STRING, dt STRING, kv Map<STRING, STRING>)
> ROW FORMAT SERDE "??"
> STORED AS SEQUENCEFILE;
>
> I want the results below.
> SELECT kv['b']
> FROM log_tb
> LIMIT 10;
>
>
> 1) How do I parsing to Complex log file (compressed(".gz", 200G)
>
> 2) If I have to SerDe, what SerDe should I use?
>
> 3) Does existed SerDe(input/output) by user define class?
>
> 4) If I use to partition with log file, how use to DDL, DML?..plz. sample
> sql (DDL, DML)
>
>
> Thanks.
>



-- 
____________
*From*.* Kiwon Lee (Ethan Lee)*
         kiwoni....@samsung.com
         kiwoni....@gmail.com

Reply via email to