For #1, You really don’t want to do what is suggested by the HBase book. Yes you can do it, but then again, just because you can do something doesn’t mean you should. Its really bad advice.
HBase is IRT not CRUD. (IRT == Insert, Read, Tombstone) If there is a temporal component to your data, store them in different cells where time becomes part of your column descriptor. So far of the use cases, Splice Machines’s relational model seems to make the most of the versioning. They can control the depth and timeouts when they roll back transactions… this is where tombstones come in to play. (Although isolation levels and RDBMS RLL comes in to play.) [Note RLL in HBase != RDBMS RLL] For #2, Why use SHA1+document ID? While SHA1 may have collisions, I can’t recall every seeing one, although its feasibly possible with a large enough data set. SHA1 and SHA2 are slower than MD5. If you’re going to want to have a somewhat even distribution, you could use the MD5 hash which is faster, truncate that and prepend it to the document ID. If the Document IDs are not being inserted in sequence, you shouldn’t have to worry about hot spotting. If you use the Hash, you lose the ability to do range scans, therefore you have to know your document ID in order to generate the hash and get your document. That’s your only access method besides a full table scan, or using secondary indexes. > On May 3, 2015, at 9:37 AM, Ted Yu <[email protected]> wrote: > > For #1, see http://hbase.apache.org/book.html#versions and > http://hbase.apache.org/book.html#schema.versions > > Cheers > > On Fri, May 1, 2015 at 9:17 PM, Arun Patel <[email protected]> wrote: > >> 1) Are there any problems having many versions for a column family? What's >> the recommended limit? >> >> 2) We have created a table for storing documents related data. All >> applications in our company are storing their documents data in same table >> with rowkey as SHA1+Document ID. Table is growing pretty rapidly. I am >> not seeing any issues as of now. But, what kind of problems can be >> expected with this approach in future? First of all, Is this approach >> correct? >> >> Thanks, >> Arun >> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
