Otis: You can find examples of how these methods are used in Phoenix. Namely: phoenix-core//src/main/java/org/apache/hadoop/hbase/regionserver/IndexHalfStoreFileReaderGenerator.java phoenix-core//src/main/java/org/apache/phoenix/coprocessor/UngroupedAggregateRegionObserver.java phoenix-core//src/main/java/org/apache/phoenix/hbase/index/Indexer.java
FYI On Fri, Jan 9, 2015 at 12:03 PM, Nick Dimiduk <[email protected]> wrote: > I haven't written against this API yet, so I don't know all these answers > off the top of my head. The interface you're interested in are the > preCompact* methods in > > http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html > > On Fri, Jan 9, 2015 at 6:35 AM, Otis Gospodnetic < > [email protected] > > wrote: > > > Hi, > > > > What Nick suggests below about using Compaction Coprocessor sounds > > potentially very useful for us. Q below. > > > > On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <[email protected]> wrote: > > > > > Not to dig too deep into ancient history, but Tsuna's comments are > mostly > > > still relevant today, except for... > > > > > > You also generally end up with fewer, bigger regions, which is almost > > > > always better. This entails that your RS are writing more data to > > fewer > > > > WALs, which leads to more sequential writes across the board. You'll > > end > > > > up with fewer HLogs, which is also a good thing. > > > > > > > > > HBase is one WAL per region server and has been for as long as I've > paid > > > attention. Unless I've missed something, number of tables doesn't > change > > > this fixed number. > > > > > > If you use HBase's client (which is most likely the case as the only > > other > > > > alternative is asynchbase), beware that you need to create one HTable > > > > instance per table per thread in your application code. > > > > > > > > > You can still write your client application this way, but the preferred > > > idiom is to use a single Connection instance from which all these > > resources > > > are shared across HTable instances. This pattern is reinforced in the > new > > > client API introduced in 1.0 > > > > > > FYI, I think you can write a Compaction coprocessor that implements > your > > > data expiration policy through normal compaction operations, thereby > > > removing the necessity of the (expensive?) scan + write delete pattern > > > entirely. > > > > > > > We actually do 2 types of full scans: > > 1) scan everything and delete rows > N days old, where N can be different > > for different users > > 2) scan everything and merge multiple rows into 1 row via HBaseHUT - > > https://github.com/sematext/HBaseHUT > > > > 2) is more expensive than 1). > > I'm wondering if we could use Compaction Coprocessor for 2)? HBaseHUT > > needs to be able to grab N rows and merge them into 1, delete those N > rows, > > and just write that 1 new row. This N could be several thousand rows. > > Could Compaction Coprocessor really be used for that? > > > > Also, would that come into play during minor or major compactions or > both? > > > > Thanks, > > Otis > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > > > > > > -n > > > > > > On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic < > > > [email protected] > > > > wrote: > > > > > > > Hi, > > > > > > > > It's been asked before, but I didn't find any *definite* answers and > a > > > lot > > > > of answers I found via are from a whiiiile back. > > > > > > > > e.g. Tsuna provided pretty convincing info here: > > > > > > > > > > > > > > http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+ > > > > > > > > ... but that is from 3 years ago. Maybe things changed? > > > > > > > > Here's our use case: > > > > > > > > Data/table layout: > > > > * HBase is used for storing metrics at different granularities > (1min, 5 > > > > min.... - a total of 6 different granularities) > > > > * It's a multi-tenant system > > > > * Keys are carefully crafted and include userId + number, where this > > > number > > > > contains the time and the granularity > > > > * Everything's in 1 table and 1 CF > > > > > > > > Access: > > > > * We only access 1 system at a time, for a specific time range, and > > > > specific granularity > > > > * We periodically scan ALL data and delete data older than N days, > > where > > > N > > > > varies from user to user > > > > * We periodically scan ALL data and merge multiple rows (of the same > > > > granularity) into 1 > > > > > > > > Question: > > > > Would there be any advantage in having 6 tables - one for each > > > granularity > > > > - instead of having everything in 1 table? > > > > Assume each table would still have just 1 CF and the keys would > remain > > > the > > > > same. > > > > > > > > Thanks, > > > > Otis > > > > -- > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log > Management > > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > >
