Re: Splitting up an HBase Table into partitions

Michael Segel Wed, 18 Mar 2015 05:29:53 -0700

> On Mar 18, 2015, at 1:52 AM, Gokul Balakrishnan <[email protected]> wrote:
> 
> 
> 
> @Sean this was exactly what I was looking for. Based on the region
> boundaries, I should be able to create virtual groups of rows which can
> then be retrieved from the table (e.g. through a scan) on demand.
>


Huh? 

You don’t need to do this. 

Its already done for you by the existing APIs. 

A scan will allow you to do either a full table scan (no range limits provided) 
or a range scan where you provide the boundaries. 

So if you’re using a client connection to HBase, its done for you. 

If you’re writing a M/R job, you are already getting one mapper task assigned 
per region.  So your parallelism is already done for you. 

Its possible that the Input Format is smart enough to pre-check the regions to 
see if they are within the boundaries or not and if not, no mapper task is 
generated.

HTH

-Mike

> Thanks everyone for your help.
> 
> On 18 March 2015 at 00:57, Sean Busbey <[email protected]> wrote:
> 
>> You should ask for a RegionLocator if you want to know the boundaries of
>> all the regions in a table
>> 
>> 
>> final Connection connection = ConnectionFactory.createConnection(config);
>> 
>> try {
>> 
>>  final RegionLocator locator =
>> connection.getRegionLocator(TableName.valueOf("myTable"));
>> 
>>  final Pair<byte[][], byte[][]> startEndKeys = locator.getStartEndKeys();
>> 
>>  final byte[][] startKeys = startEndKeys.getFirst();
>> 
>>  final byte[][] endKeys = startEndKeys.getSecond();
>> 
>>  for (int i=0; i < startKeys.length && i < endKeys.length; i++) {
>> 
>>     System.out.println("Region " + i + " starts at '" +
>> Bytes.toStringBinary(startKeys[i]) +
>> 
>>         "' and ends at '" + Bytes.toStringBinary(endKeys[i]));
>> 
>>  }
>> 
>> } finally {
>> 
>>  connection.close();
>> 
>> }
>> 
>> 
>> There are other methods in RegionLocator if you need other details.
>> 
>> On Tue, Mar 17, 2015 at 2:09 PM, Gokul Balakrishnan <[email protected]>
>> wrote:
>> 
>>> Hi Michael,
>>> 
>>> Thanks for the reply. Yes, I do realise that HBase has regions, perhaps
>> my
>>> usage of the term partitions was misleading. What I'm looking for is
>>> exactly what you've mentioned - a means of creating splits based on
>>> regions, without having to iterate over all rows in the table through the
>>> client API. Do you have any idea how I might achieve this?
>>> 
>>> Thanks,
>>> 
>>> On Tuesday, March 17, 2015, Michael Segel <[email protected]>
>>> wrote:
>>> 
>>>> Hbase doesn't have partitions.  It has regions.
>>>> 
>>>> The split occurs against the regions so that if you have n regions, you
>>>> have n splits.
>>>> 
>>>> Please don't confuse partitions and regions because they are not the
>> same
>>>> or synonymous.
>>>> 
>>>>> On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <[email protected]
>>>> <javascript:;>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> My requirement is to partition an HBase Table and return a group of
>>>> records
>>>>> (i.e. rows having a specific format) without having to iterate over
>> all
>>>> of
>>>>> its rows. These partitions (which should ideally be along regions)
>> will
>>>>> eventually be sent to Spark but rather than use the HBase or Hadoop
>>> RDDs
>>>>> directly, I'll be using a custom RDD which recognizes partitions as
>> the
>>>>> aforementioned group of records.
>>>>> 
>>>>> I was looking at achieving this through creating InputSplits through
>>>>> TableInputFormat.getSplits(), as being done in the HBase RDD [1] but
>> I
>>>>> can't figure out a way to do this without having access to the mapred
>>>>> context etc.
>>>>> 
>>>>> Would greatly appreciate if someone could point me in the right
>>>> direction.
>>>>> 
>>>>> [1]
>>>>> 
>>>> 
>>> 
>> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
>>>>> 
>>>>> Thanks,
>>>>> Gokul
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a
>> cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Sean
>> 

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Splitting up an HBase Table into partitions

Reply via email to