Hi all, I want to know that, whether Hive supports using a custom MapReduce script to load data into a table with static/hybrid/dynamic partitions?
I'm working on a log analysis program, for instance, the Tomcat log. Firstly, I import the raw/original log file into Hive, then I need to do some analysis(eg. parse URIs to detect vulnerbility), and store the results in another Hive table(probably adding several columns). I've achieved what I've described above using a custom MapReduce script written in Perl. But what disappoints me is that the query quite slow. So I decide to speed up the query using Hive (dynamic) partitions. Unfortunately, I didn't succed. Then I began to wonder whether Hive supports using a custom MapReduce script to load data into a table with partitions. Here is the related HiveQL script: ------------------------------------------------------------------------------------------ ------------------------------ original table, tomcat0 CREATE EXTERNAL TABLE if not exists tomcat0 (full_time STRING, client_ip STRING, hyphen STRING, server_ip STRING, method STRING, uri STRING, protocol STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe' WITH SERDEPROPERTIES ( 'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol', 'quote.delim'='("|\\[|\\])', 'field.delim'=' ', 'serialization.null.format'='-') STORED AS TEXTFILE LOCATION '/var/hive_data/logdata_external'; load data local inpath '/home/hduser/tomcat_log' overwrite INTO TABLE tomcat0; ------------------------------ target table, tomcat1 CREATE TABLE if not exists tomcat1 (time TIMESTAMP, time_zone STRING, client_ip STRING, hyphen STRING, server_ip STRING, method STRING, uri STRING, protocol STRING, uri_attack_type INT) PARTITIONED BY (partition_date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; add file /home/hduser/Hive.pl/Tomcat.pl; INSERT overwrite TABLE tomcat1 partition(partition_date) SELECT TRANSFORM (full_time, client_ip, hyphen, server_ip, method, uri, protocol) USING '/usr/bin/perl Tomcat.pl' AS (time, time_zone, client_ip, hyphen, server_ip, method, uri, protocol, uri_attack_type, partition_date) FROM tomcat0 distribute by partition_date limit 2; ------------------------------------------------------------------------------------------ I will give a short explanation: I'm trying use dynamic partition by date(not Hive timestamp), and the partition column is generated by the custom MapReduce script. I don't know whether it's acceptable in Hive. I also tried to add static partition in advance: alter table tomcat1 add partition (partition_time='2013-08-31'); but still no luck. :-( The typical input/output of the custom MapReduce script, Tomcat.pl: INPUT (7 columns): 31/Aug/2012:00:14:20+0800 202.104.70.250 - 192.168.202.245 GET /wxcs/mains.html?1=1' HTTP/1.1 OUTPUT (10 columns): 2012-08-31 00:14:20 +0800 202.104.70.250 - 192.168.202.245 GET /wxcs/mains.html?1=1' HTTP/1.1 1 2012-08-31 Error messages are typically like this: ------------------------------------------------------------------------------------------ Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]: An error occurred while reading or writing to your custom script. It may have crashed with an error. at org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:410) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534) ... 9 more ------------------------------------------------------------------------------------------ I don't think the problem directly exists in my Perl script, though Hive complains on this. Sorry for such a long e-mail. :-( Any input is appreciated. :-)