Thank you for your response. However, I'm still having a hard time understanding you. Apologies for this.
So, this is where I think I'm getting confused : Let's talk about the original rowkey, before anything has been prepended to it. Let's call this original_rowkey. Let's say your first original_rowkey is 1000, and your second original_rowkey is 1001. Let's say you have a hashing function called f(). Let's say you have 20 regions. Does a monotonically increasing original_rowkey guarantee a monotonically increasing return value from f()? I did not think that was the case. To my knowledge, f(1001) % 20 is not guaranteed to be larger than f(1000) % 20. Now, let's talk about the rowkey that I'm going to use when I insert the row into HBase. This will be the original_rowkey with f(x) % 20 prepended to it. Let's call this ultimate_rowkey. Since ultimate_rowkey is just original_rowkey with f(x) % 20 prepended to it, and f(x) % 20 does not increase monotonically, why would I be seeing the behavior that you describe? --Jeremy On Wed, May 6, 2015 at 10:03 PM, Michael Segel <[email protected]> wrote: > Jeremy, > > I think you have to be careful in how you say things. > While over time, you’re going to get an even distribution, the hash isn’t > random. Its consistent so that hash(x) = y and will always be the same. > You’re taking the modulus to create 1 to n buckets. > > In each bucket, your new key is n_rowkey where rowkey is the original row > key. > > Remember that the rowkey is growing sequentially. rowkey(n) < rowkey(n+1) > … < rowkey(n+k) > > So if you hash and take its modulus and prepend it, you will still have > X_rowkey(n) , X_rowkey(n+k) , … > > > All you have is N sequential lists. And again with a sequential list, > you’re adding to the right so when you split, the top section is never > going to get new rows. > > I think you need to create a list and try this with 3 or 4 buckets and > you’ll start to see what happens. > > The last region fills, but after it splits, the top half is static. The > new rows are added to the bottom half only. > > This is a problem with sequential keys that you have to learn to live with. > > Its not a killer issue, but something you need to be aware… > > > On May 6, 2015, at 4:00 PM, jeremy p <[email protected]> > wrote: > > > > Thank you for the explanation, but I'm a little confused. The key will > be > > monotonically increasing, but the hash of that key will not be. > > > > So, even though your original keys may look like : 1_foobar, 2_foobar, > > 3_foobar > > After the hashing, they'd look more like : 349000_1_foobar, > > 999999_2_foobar, 000001_3_foobar > > > > With five regions, the original key ranges for your regions would look > > something like : 000000-199999, 200000-399999, 400000-599999, > > 600000-799999, 800000-99999 > > > > So let's say you add another row. It causes a split. Now your regions > > look like : 000000-199999, 200000-399999, 400000-599999, 600000-799999, > > 800000-899999, 900000-999999 > > > > Since the value that you are prepending to your keys is essentially > random, > > I don't see why your regions would only fill halfway. A new, hashed key > > would be just as likely to fall within 800000-899999 as it would be to > fall > > within 900000-999999. > > > > Are we working from different assumptions? > > > > On Tue, May 5, 2015 at 4:46 PM, Michael Segel <[email protected] > > > > wrote: > > > >> Yes, what you described mod(hash(rowkey),n) where n is the number of > >> regions will remove the hotspotting issue. > >> > >> However, if your key is sequential you will only have regions half full > >> post region split. > >> > >> Look at it this way… > >> > >> If I have a key that is a sequential count 1,2,3,4,5 … I am always > adding > >> a new row to the last region and its always being added to the right. > >> (reading left from right.) Always at the end of the line… > >> > >> So if I have 10,000 rows and I split the region… region 1 has 0 to 4,999 > >> and region 2 has 5000 to 10000. > >> > >> Now my next row is 10001, the following is 10002 … so they will be added > >> at the tail end of region 2 until it splits. (And so on, and so on…) > >> > >> If you take a modulus of the hash, you create n buckets. Again for each > >> bucket… I will still be adding a new larger number so it will be added > to > >> the right hand side or tail of the list. > >> > >> Once a region is split… that’s it. > >> > >> Bucketing will solve the hot spotting issue by creating n lists of rows, > >> but you’re still always adding to the end of the list. > >> > >> Does that make sense? > >> > >> > >>> On May 5, 2015, at 10:04 AM, jeremy p <[email protected]> > >> wrote: > >>> > >>> Thank you for your response! > >>> > >>> So I guess 'salt' is a bit of a misnomer. What I used to do is this : > >>> > >>> 1) Say that my key value is something like '1234foobar' > >>> 2) I obtain the hash of '1234foobar'. Let's say that's '54824923' > >>> 3) I mod the hash by my number of regions. Let's say I have 2000 > >> regions. > >>> 54824923 % 2000 = 923 > >>> 4) I prepend that value to my original key value, so my new key is > >>> '923_1234foobar' > >>> > >>> Is this the same thing you were talking about? > >>> > >>> A couple questions : > >>> > >>> * Why would my regions only be 1/2 full? > >>> * Why would I only use this for sequential keys? I would think this > >> would > >>> give better performance in any situation where I don't need range > scans. > >>> For example, let's say my key value is a person's last name. That will > >>> naturally cluster around certain letters, giving me an uneven > >> distribution. > >>> > >>> --Jeremy > >>> > >>> > >>> > >>> On Sun, May 3, 2015 at 11:46 AM, Michael Segel < > >> [email protected]> > >>> wrote: > >>> > >>>> Yes, don’t use a salt. Salt implies that your seed is orthogonal (read > >>>> random) to the base table row key. > >>>> You’re better off using a truncated hash (md5 is fastest) so that at > >> least > >>>> you can use a single get(). > >>>> > >>>> Common? > >>>> > >>>> Only if your row key is mostly sequential. > >>>> > >>>> Note that even with bucketing, you will still end up with regions only > >> 1/2 > >>>> full with the only exception being the last region. > >>>> > >>>>> On May 1, 2015, at 11:09 AM, jeremy p < > [email protected]> > >>>> wrote: > >>>>> > >>>>> Hello all, > >>>>> > >>>>> I've been out of the HBase world for a while, and I'm just now > jumping > >>>> back > >>>>> in. > >>>>> > >>>>> As of HBase .94, it was still common to take a hash of your RowKey > and > >>>> use > >>>>> that to "salt" the beginning of your RowKey to obtain an even > >>>> distribution > >>>>> among your region servers. Is this still a common practice, or is > >> there > >>>> a > >>>>> better way to do this in HBase 1.0? > >>>>> > >>>>> --Jeremy > >>>> > >>>> The opinions expressed here are mine, while they may reflect a > cognitive > >>>> thought, that is purely accidental. > >>>> Use at your own risk. > >>>> Michael Segel > >>>> michael_segel (AT) hotmail.com > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >> > >> The opinions expressed here are mine, while they may reflect a cognitive > >> thought, that is purely accidental. > >> Use at your own risk. > >> Michael Segel > >> michael_segel (AT) hotmail.com > >> > >> > >> > >> > >> > >> > > The opinions expressed here are mine, while they may reflect a cognitive > thought, that is purely accidental. > Use at your own risk. > Michael Segel > michael_segel (AT) hotmail.com > > > > > >
