RE: Bulk-load to HBase

innowireless TaeYun Kim Fri, 19 Sep 2014 05:28:52 -0700

In fact, it seems that Put can be used by HFileOutputFormat, so Put object 
itself may not be the problem.

The problem is that TableOutputFormat uses the Put object in the normal way 
(that goes through normal write path), while HFileOutFormat uses it to directly 
build the HFile.

From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] 
Sent: Friday, September 19, 2014 9:20 PM
To: user@spark.apache.org
Subject: RE: Bulk-load to HBase

Thank you for the example code.

Currently I use foreachPartition() + Put(), but your example code can be used 
to clean up my code.

BTW, since the data uploaded by Put() goes through normal HBase write path, it 
can be slow.

So, it would be nice if bulk-load could be used, since it bypasses the write 
path.

Thanks.

From: Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com] 
Sent: Friday, September 19, 2014 9:01 PM
To: innowireless TaeYun Kim
Cc: user
Subject: Re: Bulk-load to HBase

I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat instead 
of HFileOutputFormat. But, hopefully this should help you:

val hbaseZookeeperQuorum = s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath"

val conf = HBaseConfiguration.create()

conf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum)

conf.set(TableOutputFormat.QUORUM_ADDRESS, hbaseZookeeperQuorum)

conf.set(TableOutputFormat.QUORUM_PORT, zookeeperPort.toString)

conf.setClass("mapreduce.outputformat.class", 
classOf[TableOutputFormat[Object]], classOf[OutputFormat[Object, Writable]])

conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

val rddToSave: RDD[(Array[Byte], Array[Byte], Array[Byte])] = ... // Some RDD 
that contains row key, column qualifier and data

val putRDD = rddToSave.map(tuple => {

    val (rowKey, column data) = tuple

    val put: Put = new Put(rowKey)

    put.add(COLUMN_FAMILY_RAW_DATA_BYTES, column, data)

    (new ImmutableBytesWritable(rowKey), put)

})

putRDD.saveAsNewAPIHadoopDataset(conf)

On 19 September 2014 16:52, innowireless TaeYun Kim 
<taeyun....@innowireless.co.kr> wrote:

Hi,

Sorry, I just found saveAsNewAPIHadoopDataset.

Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there any 
example code for that?

Thanks.

From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] 
Sent: Friday, September 19, 2014 8:18 PM
To: user@spark.apache.org
Subject: RE: Bulk-load to HBase

Hi,

After reading several documents, it seems that saveAsHadoopDataset cannot use 
HFileOutputFormat.

It’s because saveAsHadoopDataset method uses JobConf, so it belongs to the old 
Hadoop API, while HFileOutputFormat is a member of mapreduce package which is 
for the new Hadoop API.

Am I right?

If so, is there another method to bulk-load to HBase from RDD?

Thanks.

From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] 
Sent: Friday, September 19, 2014 7:17 PM
To: user@spark.apache.org
Subject: Bulk-load to HBase

Hi,

Is there a way to bulk-load to HBase from RDD?

HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but I 
cannot figure out how to use it with saveAsHadoopDataset.

Thanks.

RE: Bulk-load to HBase

Reply via email to