[ 
https://issues.apache.org/jira/browse/HIVE-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14081584#comment-14081584
 ] 

Sushanth Sowmyan commented on HIVE-4765:
----------------------------------------

[~navis], this patch is an exciting one for me, because I've long wanted to 
work on introducing OutputCommitter semantics into hive. And given that we've 
wanted to revamp the hbase bulk load as well for a while, this is a double-win 
for me.

That said, I do have a few thoughts on the introduction of the 
HiveOutputCommitter.

a) I like that you added a completed() along witht he commit() that allows 
signalling the end of the commit process. This is a good addition. I think I 
would have liked some way to add a failed() or equivalent also, I think, to 
make sure we can signal that something on our end failed, say while moving 
files or somesuch.

b) One of my pet peeves with HiveOutputFormat in general is the impedance 
mismatches in RecordWriter vs. HiveRecordWriter, and the lack of an 
OutputCommitter has meant that generic OutputFormats would need to be ported 
over to Hive, or developed completely within hive, rather than being usable 
as-is. Thus, one of my major goals for introducing an OutputCommitter semantic 
would be to reduce that mismatch, and move hive towards being able to consume a 
generic M/R IF / OF with no additional work. To this end, I'm a little wary of 
introducing a HiveOutputCommitter that will similarly have a mismatch that 
needs to be "fixed" in the way that the HiveRecordWriter needs to be, in case 
people implement the interface currently being introduced, and then we worry 
about having to break them to clean up the interface.

c) I would prefer HiveOutputFormat to have a method to create/return an output 
committer(with a default impl returning null), rather than extend 
HiveOutputCommitter. This matches the M/R form closer and will make it easier 
to bridge that gap, I think.

Also, if there was any particular reason you intentionally avoided the M/R 
Committer idiom, I'd be happy to hear that as well, and we can think on how to 
create a generic M/R storage handler to wrap generic M/R IF/OFs if need be.

> Improve HBase bulk loading facility
> -----------------------------------
>
>                 Key: HIVE-4765
>                 URL: https://issues.apache.org/jira/browse/HIVE-4765
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: HIVE-4765.2.patch.txt, HIVE-4765.3.patch.txt, 
> HIVE-4765.D11463.1.patch
>
>
> With some patches, bulk loading process for HBase could be simplified a lot.
> {noformat}
> CREATE EXTERNAL TABLE hbase_export(rowkey STRING, col1 STRING, col2 STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseExportSerDe'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:key,cf2:value")
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileExporter'
> LOCATION '/tmp/export';
> SET mapred.reduce.tasks=4;
> set hive.optimize.sampling.orderby=true;
> INSERT OVERWRITE TABLE hbase_export
> SELECT * from (SELECT union_kv(key,key,value,":key,cf1:key,cf2:value") as 
> (rowkey,union) FROM src) A ORDER BY rowkey,union;
> hive> !hadoop fs -lsr /tmp/export;                                            
>                                               
> drwxr-xr-x   - navis supergroup          0 2013-06-20 11:05 /tmp/export/cf1
> -rw-r--r--   1 navis supergroup       4317 2013-06-20 11:05 
> /tmp/export/cf1/384abe795e1a471cac6d3770ee38e835
> -rw-r--r--   1 navis supergroup       5868 2013-06-20 11:05 
> /tmp/export/cf1/b8b6d746c48f4d12a4cf1a2077a28a2d
> -rw-r--r--   1 navis supergroup       5214 2013-06-20 11:05 
> /tmp/export/cf1/c8be8117a1734bd68a74338dfc4180f8
> -rw-r--r--   1 navis supergroup       4290 2013-06-20 11:05 
> /tmp/export/cf1/ce41f5b1cfdc4722be25207fc59a9f10
> drwxr-xr-x   - navis supergroup          0 2013-06-20 11:05 /tmp/export/cf2
> -rw-r--r--   1 navis supergroup       6744 2013-06-20 11:05 
> /tmp/export/cf2/409673b517d94e16920e445d07710f52
> -rw-r--r--   1 navis supergroup       4975 2013-06-20 11:05 
> /tmp/export/cf2/96af002a6b9f4ebd976ecd83c99c8d7e
> -rw-r--r--   1 navis supergroup       6096 2013-06-20 11:05 
> /tmp/export/cf2/c4f696587c5e42ee9341d476876a3db4
> -rw-r--r--   1 navis supergroup       4890 2013-06-20 11:05 
> /tmp/export/cf2/fd9adc9e982f4fe38c8d62f9a44854ba
> hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/export test
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to