Re: use custom MapReduce script to load data into table with partitions

Satish Mittal Thu, 26 Dec 2013 03:04:34 -0800

Hi,

You could use HCatalog for reading/writing data into Hive metastore tables.
HCatalog provides HCatLoader/HCatStorer interface for Pig and
HCatInputFormat/HCatFormat interface for MR jobs to perform table
read/write, including support for static/dynamic partitions.


Thanks,
Satish


On Thu, Dec 26, 2013 at 1:59 PM, zeekvfu <wzmind...@gmail.com> wrote:

> Hi all,
>
> I want to know that, whether Hive supports using a custom MapReduce script
> to load data into a table with static/hybrid/dynamic partitions?
>
> I'm working on a log analysis program, for instance, the Tomcat log.
> Firstly, I import the raw/original log file into Hive, then I need to do
> some analysis(eg. parse URIs to detect vulnerbility), and store the results
> in another Hive table(probably adding several columns). I've achieved what
> I've described above using a custom MapReduce script written in Perl.
>
> But what disappoints me is that the query quite slow. So I decide to speed
> up the query using Hive (dynamic) partitions. Unfortunately, I didn't
> succed.
> Then I began to wonder whether Hive supports using a custom MapReduce
> script to load data into a table with partitions.
>
> Here is the related HiveQL script:
>
> ------------------------------------------------------------------------------------------
> ------------------------------ original table, tomcat0
> CREATE EXTERNAL TABLE if not exists tomcat0
> (full_time STRING, client_ip STRING, hyphen STRING, server_ip STRING,
> method STRING, uri STRING, protocol STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
> WITH SERDEPROPERTIES (
>
> 'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
> 'quote.delim'='("|\\[|\\])',
> 'field.delim'=' ',
> 'serialization.null.format'='-')
> STORED AS TEXTFILE
> LOCATION '/var/hive_data/logdata_external';
>
> load data local inpath '/home/hduser/tomcat_log' overwrite INTO TABLE
> tomcat0;
>
> ------------------------------ target table, tomcat1
> CREATE TABLE if not exists tomcat1
> (time TIMESTAMP, time_zone STRING, client_ip STRING, hyphen STRING,
> server_ip STRING, method STRING, uri STRING, protocol STRING,
> uri_attack_type INT)
> PARTITIONED BY (partition_date STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
>
> add file /home/hduser/Hive.pl/Tomcat.pl;
>
> INSERT overwrite TABLE tomcat1 partition(partition_date)
> SELECT
> TRANSFORM (full_time, client_ip, hyphen, server_ip, method, uri, protocol)
> USING '/usr/bin/perl Tomcat.pl'
> AS (time, time_zone, client_ip, hyphen, server_ip, method, uri, protocol,
> uri_attack_type, partition_date)
> FROM tomcat0
> distribute by partition_date limit 2;
>
> ------------------------------------------------------------------------------------------
>
> I will give a short explanation:
> I'm trying use dynamic partition by date(not Hive timestamp), and the
> partition column is generated by the custom MapReduce script. I don't know
> whether it's acceptable in Hive.
> I also tried to add static partition in advance:
>  alter table tomcat1 add partition (partition_time='2013-08-31');
> but still no luck. :-(
>
> The typical input/output of the custom MapReduce script, Tomcat.pl:
> INPUT (7 columns):
> 31/Aug/2012:00:14:20+0800    202.104.70.250      -       192.168.202.245 GET
>     /wxcs/mains.html?1=1'     HTTP/1.1
> OUTPUT (10 columns):
> 2012-08-31 00:14:20  +0800 202.104.70.250   -       192.168.202.245 GET
>   /wxcs/mains.html?1=1'   HTTP/1.1    1     2012-08-31
>
> Error messages are typically like this:
>
> ------------------------------------------------------------------------------------------
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error
> 20001]: An error occurred while reading or writing to your custom script.
> It may have crashed with an error.
>         at
> org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:410)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>         at
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>         at
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
>         at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>         at
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>         ... 9 more
>
> ------------------------------------------------------------------------------------------
>
> I don't think the problem directly exists in my Perl script, though Hive
> complains on this.
> Sorry for such a long e-mail. :-(
> Any input is appreciated.  :-)
>
>

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: use custom MapReduce script to load data into table with partitions

Reply via email to