Your needs (and use case?) looks a lot like the local secondary index work
happening around Phoenix.

On Wed, Apr 8, 2015 at 11:50 AM, Anoop John <[email protected]> wrote:

> bq.while the region can surely split when more data added-on, but can HBase
> keep the new regions still on the same regionServer according to the
> predefined bounary?
>
> You need custom LB for that.. If there, it is possible to restrict
>
> -Anoop-
>
>
> On Thu, Apr 9, 2015 at 12:09 AM, Demai Ni <[email protected]> wrote:
>
> > hi, Guys,
> >
> > many thanks for your quick response.
> >
> > First, Let me share what I am looking at, which may help to clarify the
> > intention and answer a few of questions. I am working on a POC to bring
> in
> > MPP style of OLAP on Hadoop, and looking for whether it is feasible to
> have
> > HBase as Datastore. With HBase, I'd like to take advantage of 1) OLTP
> > capability ; 2) many filters ; 3) in-cluster replica and between-clusters
> > replication. I am currently using TPCH schema for this POC, and also
> > consider star-schema. Since it is a POC, I can pretty much define my
> rules
> > and set limitations as it fits. :-)
> >
> > Why doesn't this(presplit) work for you?
> >
> >  The reason is that presplit won't guarantee the regions stay at the
> > pre-assigned regionServer. Let's say I have a very large table and a very
> > small table with different data distribution, even with the same presplit
> > value. HBase won't ensure the same range of data located on the same
> > physical node. Unless we have a custom LB mentioned by @Anoop and
> @esteban.
> > Is my understanding correct? BTW, I will look into HBASE-10576 to see
> > whether it fits my needs.
> >
> > Is your table staic?
> > >
> > while I can make it static for POC purpose, but I will use this
> limitation,
> > as I'd like the HBase for its OLTP feature. So besides the 'static'
> HFile,
> > need HLOGs on the same local node too. But again, I would worry about the
> > 'static' HFile for now
> >
> > However as you add data to the table, those regions will eventually
> split.
> >
> >  while the region can surely split when more data added-on, but can HBase
> > keep the new regions still on the same regionServer according to the
> > predefined bounary? I will worry about hotspot-issue late. that is the
> > beauty of doing POC instead of production. :-)
> >
> > What you’re suggesting is that as you do a region scan, you’re going to
> the
> > > other table and then try to fetch a row if it exists.
> > >
> > Yes, something like that. I am currently using the client API: scan()
> with
> > start and end key.  Since I know my start and end keys, and with the
> > local-read feature, the scan should be local-READ. With some
> > statistics(such as which one is larger table) and  a hash join
> > operation(which I need to implement), the join will work with not-too-bad
> > performance. Again, it is POC, so I won't worry about the situation that
> a
> > regionServer hosts too much data(hotspot). But surely, a LB should be
> used
> > before putting into production if it ever occurs.
> >
> > either the second table should be part of the first table in the same CF
> or
> > > as a separate CF
> > >
> > I am not sure whether it will work for a situation of a large table vs a
> > small table. The data of the small table has to be duplicated in many
> > places, and a update of the small table can be costly.
> >
> > Demai
> >
> >
> > On Wed, Apr 8, 2015 at 10:24 AM, Esteban Gutierrez <[email protected]
> >
> > wrote:
> >
> > > +1 Anoop.
> > >
> > > Thats pretty much the only way right now if you need a custom
> balancing.
> > > This balancer doesn't have to live in the HMaster and can be invoked
> > > externally (there are caveats of doing that, when a RS die but works ok
> > so
> > > far). A long term solution for your the problem you are trying to solve
> > is
> > > HBASE-10576 by tweaking it a little.
> > >
> > > cheers,
> > > esteban.
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > > On Wed, Apr 8, 2015 at 4:41 AM, Michael Segel <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Is your table staic?
> > > >
> > > > If you know your data and your ranges, you can do it. However as you
> > add
> > > > data to the table, those regions will eventually split.
> > > >
> > > > The other issue that you brought up is that you want to do ‘local’
> > joins.
> > > >
> > > > Simple single word response… don’t.
> > > >
> > > > Longer response..
> > > >
> > > > You’re suggesting that the tables in question share the row key in
> > > > common.  Ok… why? Are they part of the same record?
> > > > How is the data normally being used?
> > > >
> > > > Have you looked at column families?
> > > >
> > > > The issue is that joins are expensive. What you’re suggesting is that
> > as
> > > > you do a region scan, you’re going to the other table and then try to
> > > fetch
> > > > a row if it exists.
> > > > So its essentially for each row in the scan, try a get() which will
> > > almost
> > > > double the cost of your fetch. Then you have to decide how to do it
> > > > locally. Are you really going to write a coprocessor for this?
> (Hint:
> > If
> > > > this is a common thing. Then either the second table should be part
> of
> > > the
> > > > first table in the same CF or as a separate CF. You need to rethink
> > your
> > > > schema.)
> > > >
> > > > Does this make sense?
> > > >
> > > > > On Apr 7, 2015, at 7:05 PM, Demai Ni <[email protected]> wrote:
> > > > >
> > > > > hi, folks,
> > > > >
> > > > > I have a question about region assignment and like to clarify some
> > > > through.
> > > > >
> > > > > Let's say I have a table with rowkey as "row00000 ~ row30000" on a
> 4
> > > node
> > > > > hbase cluster, is there a way to keep data partitioned by range on
> > each
> > > > > node? for example:
> > > > >
> > > > > node1:  <=row10000
> > > > > node2:  row10001~row20000
> > > > > node3:  row20001~row30000
> > > > > node4:  >row30000
> > > > >
> > > > > And even when one of the node become hotspot, the boundary won't be
> > > > crossed
> > > > > unless manually doing a load balancing?
> > > > >
> > > > > I looked at presplit: { SPLITS => ['row100','row200','row300'] } ,
> > but
> > > > > don't think it serves this purpose.
> > > > >
> > > > > BTW, a bit background. I am thinking to do a local join between two
> > > > tables
> > > > > if both have same rowkey, and partitioned by range (or same hash
> > > > > algorithm). If I can keep the join-key on the same node(aka
> > > > regionServer),
> > > > > the join can be handled locally instead of broadcast to all other
> > > nodes.
> > > > >
> > > > > Thanks for your input. A couple pointers to blog/presentation would
> > be
> > > > > appreciated.
> > > > >
> > > > > Demai
> > > >
> > > > The opinions expressed here are mine, while they may reflect a
> > cognitive
> > > > thought, that is purely accidental.
> > > > Use at your own risk.
> > > > Michael Segel
> > > > michael_segel (AT) hotmail.com
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Reply via email to