bq. revive some notion of tiered compaction Did you have a chance to try out Stripe compaction ?
Thanks On Mon, Aug 3, 2015 at 11:14 AM, Dave Latham <[email protected]> wrote: > Jean-Marc, > > "Recent" is often last 24 hours or so, though if this is worked out I may > use it for other ranges as well. Yes, currently there are weekly major > compactions, so recently compacted regions would not be able to exclude the > old store files. That's why I'm also hoping to revive some notion of tiered > compaction to keep older data in separate store files from recent data. > > Dave > > On Sun, Aug 2, 2015 at 6:22 AM, Jean-Marc Spaggiari < > [email protected] > > wrote: > > > Just thinking at loud : > > "Cutting out the old store files could well also reduce disk IO for > > that family by 100x." > > > > What is "recent" for your data? More than 7 days? Or less? Don't you > have > > weekly major compactions? If so and if you are scanning for more than 7 > > days, then you will read the older files anyway, no? > > > > JM > > Le 2015-08-02 05:57, "Ted Yu" <[email protected]> a écrit : > > > > > Dave: > > > I wonder if Filter response can be enhanced in the following manner: > > > > > > http://pastebin.com/sb6apTPm > > > > > > My approach is based on using essential column family (column family A > in > > > your case) to guide whether the remaining column families should be > > loaded. > > > To be specific, if outside the TimeRange you specify (last day), your > > > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW. > > > > > > What do you think ? > > > > > > Cheers > > > > > > On Sat, Aug 1, 2015 at 8:06 PM, Dave Latham <[email protected]> > wrote: > > > > > > > Thanks for brainstorming, Ted. That sounds like option 2 I listed > > using > > > a > > > > separate scanner for A vs B which "adds complexity to the job and > gives > > > up > > > > the atomicity/consistency guarantees as new writes hit both column > > > > families". > > > > > > > > On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu <[email protected]> wrote: > > > > > > > > > Can you achieve your goal with two scans ? > > > > > The first scan specifies TimeRange corresponding to last day. This > > scan > > > > > returns both column families. > > > > > The other scan specifies TimeRange excluding last day. This scan > > > returns > > > > > column family A. > > > > > > > > > > Cheers > > > > > > > > > > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham <[email protected]> > > > wrote: > > > > > > > > > > > Hi Ted, > > > > > > > > > > > > Thanks for the suggestion, but I'm not sure that it helps my case > > > much. > > > > > I > > > > > > wasn't very familiar with the feature, and it doesn't seem very > > well > > > > > > documented - I had to go to the source and the originating JIRA > to > > > > > > understand how it works. It sounds like it allows you to mark > > which > > > > > column > > > > > > families the filter operates on ("essential" seems an odd name). > > If > > > > any > > > > > > data from those column families passes the filter, then the scan > > > loads > > > > > and > > > > > > includes data from the remaining families without filtering it. > In > > > my > > > > > > case, it's not clear from a row's family A whether or not family > B > > > for > > > > > that > > > > > > row is required (though that could probably be added). Moreover, > > > even > > > > > if a > > > > > > row has recent data, we don't want to load all the old data from > > that > > > > > row. > > > > > > We'd prefer to be able to entirely skip reading the data off disk > > for > > > > the > > > > > > old store files. > > > > > > > > > > > > Dave > > > > > > > > > > > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <[email protected]> > > wrote: > > > > > > > > > > > > > Have you considered using essential column family feature > > (through > > > > > > Filter) > > > > > > > ? > > > > > > > In your case A would be the essential column family. > > > > > > > Within TimeRange for recent data, the filter would return both > > > column > > > > > > > families. > > > > > > > Outside the TimeRange, only family A is returned. > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > > > I have a table with 2 column families, call them A and B, > with > > > new > > > > > data > > > > > > > > regularly being added. They are very different sizes: B is > 100x > > > the > > > > > > size > > > > > > > of > > > > > > > > A. Among other uses for this data, I have a MapReduce job > that > > > > needs > > > > > > to > > > > > > > > read all of A, but only recent data from B (e.g. last day). > > Here > > > > are > > > > > > > some > > > > > > > > methods I've considered: > > > > > > > > > > > > > > > > 1. Use a Filter to get throw out older data from B (this > is > > > > what I > > > > > > > > currently do). However, all the data from B still needs > to > > be > > > > > read > > > > > > > from > > > > > > > > disk, causing a disk IO bottleneck. > > > > > > > > 2. Configure the table input format to read from B only, > > > using a > > > > > > > > TimeRange for recent data, and have each map task open a > > > > separate > > > > > > > > scanner > > > > > > > > for A (without a TimeRange) then merge the data in the map > > > task. > > > > > > > > However, > > > > > > > > this adds complexity to the job and gives up the > > > > > > atomicity/consistency > > > > > > > > guarantees as new writes hit both column families. > > > > > > > > 3. Add a new column family C to the table with an > additional > > > > copy > > > > > of > > > > > > > the > > > > > > > > data in B, but set a TTL on it. All writes duplicate the > > data > > > > > > written > > > > > > > > to B > > > > > > > > and C. Change the scan to include C instead of B. > However, > > > > this > > > > > > adds > > > > > > > > all > > > > > > > > the overhead of another column family, more writes, and > > having > > > > to > > > > > > set > > > > > > > > the > > > > > > > > TTL to the maximum of any time window I want to scan > > > > efficiently. > > > > > > > > 4. Implement an enhancement to HBase's Scan to allow > giving > > > each > > > > > > > column > > > > > > > > family its own TimeRange. The job would then be able to > > skip > > > > most > > > > > > old > > > > > > > > large store files (hopefully all of them with tiered > > > compaction > > > > at > > > > > > > some > > > > > > > > point). > > > > > > > > > > > > > > > > Does anyone have other suggestions? Would HBase be willing > to > > > > accept > > > > > > > > updating Scan to have different TimeRange's for each column > > > > families? > > > > > > > > > > > > > > > > > > > > > > > > Dave > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
