Re: RowKey hashing in HBase 1.0

jeremy p Wed, 13 May 2015 13:32:26 -0700

Thank you for your response.  However, I'm still having a hard time
understanding you.  Apologies for this.


So, this is where I think I'm getting confused :

Let's talk about the original rowkey, before anything has been prepended to
it.  Let's call this original_rowkey.

Let's say your first original_rowkey is 1000, and your second
original_rowkey is 1001.  Let's say you have a hashing function called f().
Let's say you have 20 regions.

Does a monotonically increasing original_rowkey guarantee a monotonically
increasing return value from f()?  I did not think that was the case. To my
knowledge, f(1001) % 20 is not guaranteed to be larger than f(1000) % 20.

Now, let's talk about the rowkey that I'm going to use when I insert the
row into HBase.  This will be the original_rowkey with f(x) % 20 prepended
to it.  Let's call this ultimate_rowkey.

Since ultimate_rowkey is just original_rowkey with f(x) % 20 prepended to
it, and f(x) % 20 does not increase monotonically, why would I be seeing
the behavior that you describe?

--Jeremy


On Wed, May 6, 2015 at 10:03 PM, Michael Segel <[email protected]>
wrote:

> Jeremy,
>
> I think you have to be careful in how you say things.
> While over time, you’re going to get an even distribution, the hash isn’t
> random. Its consistent so that hash(x) = y  and will always be the same.
> You’re taking the modulus to create 1 to n buckets.
>
> In each bucket, your new key is n_rowkey  where rowkey is the original row
> key.
>
> Remember that the rowkey is growing sequentially.  rowkey(n) < rowkey(n+1)
> …  < rowkey(n+k)
>
> So if you hash and take its modulus and prepend it, you will still have
> X_rowkey(n) , X_rowkey(n+k) , …
>
>
> All you have is N sequential lists. And again with a sequential list,
> you’re adding to the right so when you split, the top section is never
> going to get new rows.
>
> I think you need to create a list  and try this with 3 or 4 buckets and
> you’ll start to see what happens.
>
> The last region fills, but after it splits, the top half is static. The
> new rows are added to the bottom half only.
>
> This is a problem with sequential keys that you have to learn to live with.
>
> Its not a killer issue, but something you need to be  aware…
>
> > On May 6, 2015, at 4:00 PM, jeremy p <[email protected]>
> wrote:
> >
> > Thank you for the explanation, but I'm a little confused.  The key will
> be
> > monotonically increasing, but the hash of that key will not be.
> >
> > So, even though your original keys may look like : 1_foobar, 2_foobar,
> > 3_foobar
> > After the hashing, they'd look more like : 349000_1_foobar,
> > 999999_2_foobar, 000001_3_foobar
> >
> > With five regions, the original key ranges for your regions would look
> > something like : 000000-199999, 200000-399999, 400000-599999,
> > 600000-799999, 800000-99999
> >
> > So let's say you add another row.  It causes a split.  Now your regions
> > look like :  000000-199999, 200000-399999, 400000-599999, 600000-799999,
> > 800000-899999, 900000-999999
> >
> > Since the value that you are prepending to your keys is essentially
> random,
> > I don't see why your regions would only fill halfway.  A new, hashed key
> > would be just as likely to fall within 800000-899999 as it would be to
> fall
> > within 900000-999999.
> >
> > Are we working from different assumptions?
> >
> > On Tue, May 5, 2015 at 4:46 PM, Michael Segel <[email protected]
> >
> > wrote:
> >
> >> Yes, what you described  mod(hash(rowkey),n) where n is the number of
> >> regions will remove the hotspotting issue.
> >>
> >> However, if your key is sequential you will only have regions half full
> >> post region split.
> >>
> >> Look at it this way…
> >>
> >> If I have a key that is a sequential count 1,2,3,4,5 … I am always
> adding
> >> a new row to the last region and its always being added to the right.
> >> (reading left from right.) Always at the end of the line…
> >>
> >> So if I have 10,000 rows and I split the region… region 1 has 0 to 4,999
> >> and region 2 has 5000 to 10000.
> >>
> >> Now my next row is 10001, the following is 10002 … so they will be added
> >> at the tail end of region 2 until it splits.  (And so on, and so on…)
> >>
> >> If you take a modulus of the hash, you create n buckets. Again for each
> >> bucket… I will still be adding a new larger number so it will be added
> to
> >> the right hand side or tail of the list.
> >>
> >> Once a region is split… that’s it.
> >>
> >> Bucketing will solve the hot spotting issue by creating n lists of rows,
> >> but you’re still always adding to the end of the list.
> >>
> >> Does that make sense?
> >>
> >>
> >>> On May 5, 2015, at 10:04 AM, jeremy p <[email protected]>
> >> wrote:
> >>>
> >>> Thank you for your response!
> >>>
> >>> So I guess 'salt' is a bit of a misnomer.  What I used to do is this :
> >>>
> >>> 1) Say that my key value is something like '1234foobar'
> >>> 2) I obtain the hash of '1234foobar'.  Let's say that's '54824923'
> >>> 3) I mod the hash by my number of regions.  Let's say I have 2000
> >> regions.
> >>> 54824923 % 2000 = 923
> >>> 4) I prepend that value to my original key value, so my new key is
> >>> '923_1234foobar'
> >>>
> >>> Is this the same thing you were talking about?
> >>>
> >>> A couple questions :
> >>>
> >>> * Why would my regions only be 1/2 full?
> >>> * Why would I only use this for sequential keys?  I would think this
> >> would
> >>> give better performance in any situation where I don't need range
> scans.
> >>> For example, let's say my key value is a person's last name.  That will
> >>> naturally cluster around certain letters, giving me an uneven
> >> distribution.
> >>>
> >>> --Jeremy
> >>>
> >>>
> >>>
> >>> On Sun, May 3, 2015 at 11:46 AM, Michael Segel <
> >> [email protected]>
> >>> wrote:
> >>>
> >>>> Yes, don’t use a salt. Salt implies that your seed is orthogonal (read
> >>>> random) to the base table row key.
> >>>> You’re better off using a truncated hash (md5 is fastest) so that at
> >> least
> >>>> you can use a single get().
> >>>>
> >>>> Common?
> >>>>
> >>>> Only if your row key is mostly sequential.
> >>>>
> >>>> Note that even with bucketing, you will still end up with regions only
> >> 1/2
> >>>> full with the only exception being the last region.
> >>>>
> >>>>> On May 1, 2015, at 11:09 AM, jeremy p <
> [email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hello all,
> >>>>>
> >>>>> I've been out of the HBase world for a while, and I'm just now
> jumping
> >>>> back
> >>>>> in.
> >>>>>
> >>>>> As of HBase .94, it was still common to take a hash of your RowKey
> and
> >>>> use
> >>>>> that to "salt" the beginning of your RowKey to obtain an even
> >>>> distribution
> >>>>> among your region servers.  Is this still a common practice, or is
> >> there
> >>>> a
> >>>>> better way to do this in HBase 1.0?
> >>>>>
> >>>>> --Jeremy
> >>>>
> >>>> The opinions expressed here are mine, while they may reflect a
> cognitive
> >>>> thought, that is purely accidental.
> >>>> Use at your own risk.
> >>>> Michael Segel
> >>>> michael_segel (AT) hotmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: RowKey hashing in HBase 1.0

Reply via email to