I haven't written against this API yet, so I don't know all these answers off the top of my head. The interface you're interested in are the preCompact* methods in http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html
On Fri, Jan 9, 2015 at 6:35 AM, Otis Gospodnetic <[email protected] > wrote: > Hi, > > What Nick suggests below about using Compaction Coprocessor sounds > potentially very useful for us. Q below. > > On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <[email protected]> wrote: > > > Not to dig too deep into ancient history, but Tsuna's comments are mostly > > still relevant today, except for... > > > > You also generally end up with fewer, bigger regions, which is almost > > > always better. This entails that your RS are writing more data to > fewer > > > WALs, which leads to more sequential writes across the board. You'll > end > > > up with fewer HLogs, which is also a good thing. > > > > > > HBase is one WAL per region server and has been for as long as I've paid > > attention. Unless I've missed something, number of tables doesn't change > > this fixed number. > > > > If you use HBase's client (which is most likely the case as the only > other > > > alternative is asynchbase), beware that you need to create one HTable > > > instance per table per thread in your application code. > > > > > > You can still write your client application this way, but the preferred > > idiom is to use a single Connection instance from which all these > resources > > are shared across HTable instances. This pattern is reinforced in the new > > client API introduced in 1.0 > > > > FYI, I think you can write a Compaction coprocessor that implements your > > data expiration policy through normal compaction operations, thereby > > removing the necessity of the (expensive?) scan + write delete pattern > > entirely. > > > > We actually do 2 types of full scans: > 1) scan everything and delete rows > N days old, where N can be different > for different users > 2) scan everything and merge multiple rows into 1 row via HBaseHUT - > https://github.com/sematext/HBaseHUT > > 2) is more expensive than 1). > I'm wondering if we could use Compaction Coprocessor for 2)? HBaseHUT > needs to be able to grab N rows and merge them into 1, delete those N rows, > and just write that 1 new row. This N could be several thousand rows. > Could Compaction Coprocessor really be used for that? > > Also, would that come into play during minor or major compactions or both? > > Thanks, > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > -n > > > > On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic < > > [email protected] > > > wrote: > > > > > Hi, > > > > > > It's been asked before, but I didn't find any *definite* answers and a > > lot > > > of answers I found via are from a whiiiile back. > > > > > > e.g. Tsuna provided pretty convincing info here: > > > > > > > > > http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+ > > > > > > ... but that is from 3 years ago. Maybe things changed? > > > > > > Here's our use case: > > > > > > Data/table layout: > > > * HBase is used for storing metrics at different granularities (1min, 5 > > > min.... - a total of 6 different granularities) > > > * It's a multi-tenant system > > > * Keys are carefully crafted and include userId + number, where this > > number > > > contains the time and the granularity > > > * Everything's in 1 table and 1 CF > > > > > > Access: > > > * We only access 1 system at a time, for a specific time range, and > > > specific granularity > > > * We periodically scan ALL data and delete data older than N days, > where > > N > > > varies from user to user > > > * We periodically scan ALL data and merge multiple rows (of the same > > > granularity) into 1 > > > > > > Question: > > > Would there be any advantage in having 6 tables - one for each > > granularity > > > - instead of having everything in 1 table? > > > Assume each table would still have just 1 CF and the keys would remain > > the > > > same. > > > > > > Thanks, > > > Otis > > > -- > > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > >
