I think TimeRange is handled higher, when region scanner is created. With data size in B 100x smaller than in A, I do not understand where is a source of IO bottleneck? On Aug 1, 2015 9:16 AM, "Andrew Purtell" <[email protected]> wrote:
> Hi Dave, > > > Would HBase be willing to accept updating Scan to have different > TimeRange's for each column families? > > We could try it. I'm not sure how familiar you are with the relevant code. > I'm guessing some? Look at ScanQueryMatcher. This and related concerns > govern how we search through store files. Timerange handling is done at the > top level (the SQM). Then for each column we have a leaf tracker > (implementing ColumnTracker) that tracks column specific info like number > of versions for a cell found in each. We'd need to push timerange handling > down into the column trackers. This would be a tricky refactor on delicate > code. I suspect we could be comfortable making this change in master and on > branch-1 for upcoming unscheduled minor release line 1.3. Would that work? > Or would this change need to go further back? > > Maybe someone else has another suggestion. > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <[email protected]> wrote: > > > I have a table with 2 column families, call them A and B, with new data > > regularly being added. They are very different sizes: B is 100x the size > of > > A. Among other uses for this data, I have a MapReduce job that needs to > > read all of A, but only recent data from B (e.g. last day). Here are > some > > methods I've considered: > > > > 1. Use a Filter to get throw out older data from B (this is what I > > currently do). However, all the data from B still needs to be read > from > > disk, causing a disk IO bottleneck. > > 2. Configure the table input format to read from B only, using a > > TimeRange for recent data, and have each map task open a separate > > scanner > > for A (without a TimeRange) then merge the data in the map task. > > However, > > this adds complexity to the job and gives up the atomicity/consistency > > guarantees as new writes hit both column families. > > 3. Add a new column family C to the table with an additional copy of > the > > data in B, but set a TTL on it. All writes duplicate the data written > > to B > > and C. Change the scan to include C instead of B. However, this adds > > all > > the overhead of another column family, more writes, and having to set > > the > > TTL to the maximum of any time window I want to scan efficiently. > > 4. Implement an enhancement to HBase's Scan to allow giving each > column > > family its own TimeRange. The job would then be able to skip most old > > large store files (hopefully all of them with tiered compaction at > some > > point). > > > > Does anyone have other suggestions? Would HBase be willing to accept > > updating Scan to have different TimeRange's for each column families? > > > > > > Dave > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >
