Re: scan column families with different time ranges

Vladimir Rodionov Sat, 01 Aug 2015 19:19:19 -0700

I think TimeRange is handled higher, when region scanner is created. With
data size in B 100x smaller than in A, I do not understand where is a
source of IO bottleneck?
On Aug 1, 2015 9:16 AM, "Andrew Purtell" <[email protected]> wrote:


> Hi Dave,
>
> >  Would HBase be willing to accept updating Scan to have different
> TimeRange's for each column families?
>
> We could try it. I'm not sure how familiar you are with the relevant code.
> I'm guessing some? Look at ScanQueryMatcher. This and related concerns
> govern how we search through store files. Timerange handling is done at the
> top level (the SQM). Then for each column we have a leaf tracker
> (implementing ColumnTracker) that tracks column specific info like number
> of versions for a cell found in each. We'd need to push timerange handling
> down into the column trackers. This would be a tricky refactor on delicate
> code. I suspect we could be comfortable making this change in master and on
> branch-1 for upcoming unscheduled minor release line 1.3. Would that work?
> Or would this change need to go further back?
>
> Maybe someone else has another suggestion.
>
>
> On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <[email protected]> wrote:
>
> > I have a table with 2 column families, call them A and B, with new data
> > regularly being added. They are very different sizes: B is 100x the size
> of
> > A.  Among other uses for this data, I have a MapReduce job that needs to
> > read all of A, but only recent data from B (e.g. last day).  Here are
> some
> > methods I've considered:
> >
> >    1. Use a Filter to get throw out older data from B (this is what I
> >    currently do).  However, all the data from B still needs to be read
> from
> >    disk, causing a disk IO bottleneck.
> >    2. Configure the table input format to read from B only, using a
> >    TimeRange for recent data, and have each map task open a separate
> > scanner
> >    for A (without a TimeRange) then merge the data in the map task.
> > However,
> >    this adds complexity to the job and gives up the atomicity/consistency
> >    guarantees as new writes hit both column families.
> >    3. Add a new column family C to the table with an additional copy of
> the
> >    data in B, but set a TTL on it.  All writes duplicate the data written
> > to B
> >    and C.  Change the scan to include C instead of B.  However, this adds
> > all
> >    the overhead of another column family, more writes, and having to set
> > the
> >    TTL to the maximum of any time window I want to scan efficiently.
> >    4. Implement an enhancement to HBase's Scan to allow giving each
> column
> >    family its own TimeRange.  The job would then be able to skip most old
> >    large store files (hopefully all of them with tiered compaction at
> some
> >    point).
> >
> > Does anyone have other suggestions?  Would HBase be willing to accept
> > updating Scan to have different TimeRange's for each column families?
> >
> >
> > Dave
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: scan column families with different time ranges

Reply via email to