Re: HBase Questions

Michael Segel Sun, 03 May 2015 08:45:05 -0700

For #1, 

You really don’t want to do what is suggested by the HBase book. 
Yes you can do it, but then again, just because you can do something doesn’t 
mean you should. Its really bad advice.

HBase is IRT not CRUD.  
(IRT == Insert, Read, Tombstone) 

If there is a temporal component to your data, store them in different cells 
where time becomes part of your column descriptor. 
So far of the use cases, Splice Machines’s relational model seems to make the 
most of the versioning. They can control the depth and timeouts when they roll 
back transactions… this is where tombstones come in to play. (Although 
isolation levels and RDBMS RLL comes in to play.) [Note RLL in HBase != RDBMS 
RLL]

For #2,

Why use SHA1+document ID? 

While SHA1 may have collisions, I can’t recall every seeing one, although its 
feasibly possible with a large enough data set. 
SHA1 and SHA2 are slower than MD5.  

If you’re going to want to have a somewhat even distribution, you could use the 
MD5 hash which is faster, truncate that and prepend it to the document ID. 

If the Document IDs are not being inserted in sequence, you shouldn’t have to 
worry about hot spotting. 

If you use the Hash, you lose the ability to do range scans, therefore you have 
to know your document ID in order to generate the hash and get your document. 
That’s your only access method besides a full table scan, or using secondary 
indexes. 

> On May 3, 2015, at 9:37 AM, Ted Yu <[email protected]> wrote:
> 
> For #1, see http://hbase.apache.org/book.html#versions and
> http://hbase.apache.org/book.html#schema.versions
> 
> Cheers
> 
> On Fri, May 1, 2015 at 9:17 PM, Arun Patel <[email protected]> wrote:
> 
>> 1) Are there any problems having many versions for a column family?  What's
>> the recommended limit?
>> 
>> 2) We have created a table for storing documents related data.  All
>> applications in our company are storing their documents data in same table
>> with rowkey as SHA1+Document ID.  Table is growing pretty rapidly.  I am
>> not seeing any issues as of now.  But, what kind of problems can be
>> expected with this approach in future?  First of all, Is this approach
>> correct?
>> 
>> Thanks,
>> Arun
>> 

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: HBase Questions

Reply via email to