Hi all,

I want to know that, whether Hive supports using a custom MapReduce script
to load data into a table with static/hybrid/dynamic partitions?

I'm working on a log analysis program, for instance, the Tomcat log.
Firstly, I import the raw/original log file into Hive, then I need to do
some analysis(eg. parse URIs to detect vulnerbility), and store the results
in another Hive table(probably adding several columns). I've achieved what
I've described above using a custom MapReduce script written in Perl.

But what disappoints me is that the query quite slow. So I decide to speed
up the query using Hive (dynamic) partitions. Unfortunately, I didn't
succed.
Then I began to wonder whether Hive supports using a custom MapReduce
script to load data into a table with partitions.

Here is the related HiveQL script:
------------------------------------------------------------------------------------------
------------------------------ original table, tomcat0
CREATE EXTERNAL TABLE if not exists tomcat0
(full_time STRING, client_ip STRING, hyphen STRING, server_ip STRING,
method STRING, uri STRING, protocol STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (
'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
'quote.delim'='("|\\[|\\])',
'field.delim'=' ',
'serialization.null.format'='-')
STORED AS TEXTFILE
LOCATION '/var/hive_data/logdata_external';

load data local inpath '/home/hduser/tomcat_log' overwrite INTO TABLE
tomcat0;

------------------------------ target table, tomcat1
CREATE TABLE if not exists tomcat1
(time TIMESTAMP, time_zone STRING, client_ip STRING, hyphen STRING,
server_ip STRING, method STRING, uri STRING, protocol STRING,
uri_attack_type INT)
PARTITIONED BY (partition_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

add file /home/hduser/Hive.pl/Tomcat.pl;

INSERT overwrite TABLE tomcat1 partition(partition_date)
SELECT
TRANSFORM (full_time, client_ip, hyphen, server_ip, method, uri, protocol)
USING '/usr/bin/perl Tomcat.pl'
AS (time, time_zone, client_ip, hyphen, server_ip, method, uri, protocol,
uri_attack_type, partition_date)
FROM tomcat0
distribute by partition_date limit 2;
------------------------------------------------------------------------------------------

I will give a short explanation:
I'm trying use dynamic partition by date(not Hive timestamp), and the
partition column is generated by the custom MapReduce script. I don't know
whether it's acceptable in Hive.
I also tried to add static partition in advance:
 alter table tomcat1 add partition (partition_time='2013-08-31');
but still no luck. :-(

The typical input/output of the custom MapReduce script, Tomcat.pl:
INPUT (7 columns):
31/Aug/2012:00:14:20+0800    202.104.70.250      -       192.168.202.245 GET
    /wxcs/mains.html?1=1'     HTTP/1.1
OUTPUT (10 columns):
2012-08-31 00:14:20  +0800 202.104.70.250   -       192.168.202.245 GET
/wxcs/mains.html?1=1'   HTTP/1.1    1     2012-08-31

Error messages are typically like this:
------------------------------------------------------------------------------------------
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]:
An error occurred while reading or writing to your custom script. It may
have crashed with an error.
        at
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:410)
        at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
        at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
        at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
        at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
        at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
        at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
        at
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
        ... 9 more
------------------------------------------------------------------------------------------

I don't think the problem directly exists in my Perl script, though Hive
complains on this.
Sorry for such a long e-mail. :-(
Any input is appreciated.  :-)

Reply via email to