@Mikhail I wanted to split the table into groups of rows, but did not want
to initialize a scan and go over all rows and group them into batches in
the client code. In other words, I'm looking for a way to divide the rows
in the table and merely maintain the boundary information of each division
rather than actually populate them at the time of creation.

@Shahab yes, the row key ranges for the splits are not known in advance,
which was why I was looking at retrieving the region information of the
table and create the groupings that way.

@Sean this was exactly what I was looking for. Based on the region
boundaries, I should be able to create virtual groups of rows which can
then be retrieved from the table (e.g. through a scan) on demand.

Thanks everyone for your help.

On 18 March 2015 at 00:57, Sean Busbey <[email protected]> wrote:

> You should ask for a RegionLocator if you want to know the boundaries of
> all the regions in a table
>
>
> final Connection connection = ConnectionFactory.createConnection(config);
>
> try {
>
>   final RegionLocator locator =
> connection.getRegionLocator(TableName.valueOf("myTable"));
>
>   final Pair<byte[][], byte[][]> startEndKeys = locator.getStartEndKeys();
>
>   final byte[][] startKeys = startEndKeys.getFirst();
>
>   final byte[][] endKeys = startEndKeys.getSecond();
>
>   for (int i=0; i < startKeys.length && i < endKeys.length; i++) {
>
>      System.out.println("Region " + i + " starts at '" +
> Bytes.toStringBinary(startKeys[i]) +
>
>          "' and ends at '" + Bytes.toStringBinary(endKeys[i]));
>
>   }
>
> } finally {
>
>   connection.close();
>
> }
>
>
> There are other methods in RegionLocator if you need other details.
>
> On Tue, Mar 17, 2015 at 2:09 PM, Gokul Balakrishnan <[email protected]>
> wrote:
>
> > Hi Michael,
> >
> > Thanks for the reply. Yes, I do realise that HBase has regions, perhaps
> my
> > usage of the term partitions was misleading. What I'm looking for is
> > exactly what you've mentioned - a means of creating splits based on
> > regions, without having to iterate over all rows in the table through the
> > client API. Do you have any idea how I might achieve this?
> >
> > Thanks,
> >
> > On Tuesday, March 17, 2015, Michael Segel <[email protected]>
> > wrote:
> >
> > > Hbase doesn't have partitions.  It has regions.
> > >
> > > The split occurs against the regions so that if you have n regions, you
> > > have n splits.
> > >
> > > Please don't confuse partitions and regions because they are not the
> same
> > > or synonymous.
> > >
> > > > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <[email protected]
> > > <javascript:;>> wrote:
> > > >
> > > > Hi,
> > > >
> > > > My requirement is to partition an HBase Table and return a group of
> > > records
> > > > (i.e. rows having a specific format) without having to iterate over
> all
> > > of
> > > > its rows. These partitions (which should ideally be along regions)
> will
> > > > eventually be sent to Spark but rather than use the HBase or Hadoop
> > RDDs
> > > > directly, I'll be using a custom RDD which recognizes partitions as
> the
> > > > aforementioned group of records.
> > > >
> > > > I was looking at achieving this through creating InputSplits through
> > > > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but
> I
> > > > can't figure out a way to do this without having access to the mapred
> > > > context etc.
> > > >
> > > > Would greatly appreciate if someone could point me in the right
> > > direction.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> > > >
> > > > Thanks,
> > > > Gokul
> > >
> > > The opinions expressed here are mine, while they may reflect a
> cognitive
> > > thought, that is purely accidental.
> > > Use at your own risk.
> > > Michael Segel
> > > michael_segel (AT) hotmail.com
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Sean
>

Reply via email to