Hi, What Nick suggests below about using Compaction Coprocessor sounds potentially very useful for us. Q below.
On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <[email protected]> wrote: > Not to dig too deep into ancient history, but Tsuna's comments are mostly > still relevant today, except for... > > You also generally end up with fewer, bigger regions, which is almost > > always better. This entails that your RS are writing more data to fewer > > WALs, which leads to more sequential writes across the board. You'll end > > up with fewer HLogs, which is also a good thing. > > > HBase is one WAL per region server and has been for as long as I've paid > attention. Unless I've missed something, number of tables doesn't change > this fixed number. > > If you use HBase's client (which is most likely the case as the only other > > alternative is asynchbase), beware that you need to create one HTable > > instance per table per thread in your application code. > > > You can still write your client application this way, but the preferred > idiom is to use a single Connection instance from which all these resources > are shared across HTable instances. This pattern is reinforced in the new > client API introduced in 1.0 > > FYI, I think you can write a Compaction coprocessor that implements your > data expiration policy through normal compaction operations, thereby > removing the necessity of the (expensive?) scan + write delete pattern > entirely. > We actually do 2 types of full scans: 1) scan everything and delete rows > N days old, where N can be different for different users 2) scan everything and merge multiple rows into 1 row via HBaseHUT - https://github.com/sematext/HBaseHUT 2) is more expensive than 1). I'm wondering if we could use Compaction Coprocessor for 2)? HBaseHUT needs to be able to grab N rows and merge them into 1, delete those N rows, and just write that 1 new row. This N could be several thousand rows. Could Compaction Coprocessor really be used for that? Also, would that come into play during minor or major compactions or both? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ > > -n > > On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic < > [email protected] > > wrote: > > > Hi, > > > > It's been asked before, but I didn't find any *definite* answers and a > lot > > of answers I found via are from a whiiiile back. > > > > e.g. Tsuna provided pretty convincing info here: > > > > > http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+ > > > > ... but that is from 3 years ago. Maybe things changed? > > > > Here's our use case: > > > > Data/table layout: > > * HBase is used for storing metrics at different granularities (1min, 5 > > min.... - a total of 6 different granularities) > > * It's a multi-tenant system > > * Keys are carefully crafted and include userId + number, where this > number > > contains the time and the granularity > > * Everything's in 1 table and 1 CF > > > > Access: > > * We only access 1 system at a time, for a specific time range, and > > specific granularity > > * We periodically scan ALL data and delete data older than N days, where > N > > varies from user to user > > * We periodically scan ALL data and merge multiple rows (of the same > > granularity) into 1 > > > > Question: > > Would there be any advantage in having 6 tables - one for each > granularity > > - instead of having everything in 1 table? > > Assume each table would still have just 1 CF and the keys would remain > the > > same. > > > > Thanks, > > Otis > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > >
