Please see http://hbase.apache.org/book.html#completebulkload
LoadIncrementalHFiles has a main() method. On Fri, Sep 19, 2014 at 5:41 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Agreed that the bulk import would be faster. In my case, I wasn't > expecting a lot of data to be uploaded to HBase and also, I didn't want to > take the pain of importing generated HFiles into HBase. Is there a way to > invoke HBase HFile import batch script programmatically? > > On 19 September 2014 17:58, innowireless TaeYun Kim < > taeyun....@innowireless.co.kr> wrote: > >> In fact, it seems that Put can be used by HFileOutputFormat, so Put >> object itself may not be the problem. >> >> The problem is that TableOutputFormat uses the Put object in the normal >> way (that goes through normal write path), while HFileOutFormat uses it to >> directly build the HFile. >> >> >> >> *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] >> *Sent:* Friday, September 19, 2014 9:20 PM >> >> *To:* user@spark.apache.org >> *Subject:* RE: Bulk-load to HBase >> >> >> >> Thank you for the example code. >> >> >> >> Currently I use foreachPartition() + Put(), but your example code can be >> used to clean up my code. >> >> >> >> BTW, since the data uploaded by Put() goes through normal HBase write >> path, it can be slow. >> >> So, it would be nice if bulk-load could be used, since it bypasses the >> write path. >> >> >> >> Thanks. >> >> >> >> *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com >> <aniket.bhatna...@gmail.com>] >> *Sent:* Friday, September 19, 2014 9:01 PM >> *To:* innowireless TaeYun Kim >> *Cc:* user >> *Subject:* Re: Bulk-load to HBase >> >> >> >> I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat >> instead of HFileOutputFormat. But, hopefully this should help you: >> >> >> >> val hbaseZookeeperQuorum = >> s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath" >> >> val conf = HBaseConfiguration.create() >> >> conf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum) >> >> conf.set(TableOutputFormat.QUORUM_ADDRESS, hbaseZookeeperQuorum) >> >> conf.set(TableOutputFormat.QUORUM_PORT, zookeeperPort.toString) >> >> conf.setClass("mapreduce.outputformat.class", >> classOf[TableOutputFormat[Object]], classOf[OutputFormat[Object, Writable]]) >> >> conf.set(TableOutputFormat.OUTPUT_TABLE, tableName) >> >> >> >> val rddToSave: RDD[(Array[Byte], Array[Byte], Array[Byte])] = ... // Some >> RDD that contains row key, column qualifier and data >> >> >> >> val putRDD = rddToSave.map(tuple => { >> >> val (rowKey, column data) = tuple >> >> val put: Put = new Put(rowKey) >> >> put.add(COLUMN_FAMILY_RAW_DATA_BYTES, column, data) >> >> >> >> (new ImmutableBytesWritable(rowKey), put) >> >> }) >> >> >> >> putRDD.saveAsNewAPIHadoopDataset(conf) >> >> >> >> >> >> On 19 September 2014 16:52, innowireless TaeYun Kim < >> taeyun....@innowireless.co.kr> wrote: >> >> Hi, >> >> >> >> Sorry, I just found saveAsNewAPIHadoopDataset. >> >> Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is >> there any example code for that? >> >> >> >> Thanks. >> >> >> >> *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] >> *Sent:* Friday, September 19, 2014 8:18 PM >> *To:* user@spark.apache.org >> *Subject:* RE: Bulk-load to HBase >> >> >> >> Hi, >> >> >> >> After reading several documents, it seems that saveAsHadoopDataset cannot >> use HFileOutputFormat. >> >> It’s because saveAsHadoopDataset method uses JobConf, so it belongs to >> the old Hadoop API, while HFileOutputFormat is a member of mapreduce >> package which is for the new Hadoop API. >> >> >> >> Am I right? >> >> If so, is there another method to bulk-load to HBase from RDD? >> >> >> >> Thanks. >> >> >> >> *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr >> <taeyun....@innowireless.co.kr>] >> *Sent:* Friday, September 19, 2014 7:17 PM >> *To:* user@spark.apache.org >> *Subject:* Bulk-load to HBase >> >> >> >> Hi, >> >> >> >> Is there a way to bulk-load to HBase from RDD? >> >> HBase offers HFileOutputFormat class for bulk loading by MapReduce job, >> but I cannot figure out how to use it with saveAsHadoopDataset. >> >> >> >> Thanks. >> >> >> > >