Thanks for looking at the code. Recent improvement in this area was: HBASE-8063 Filter HFiles based on first/last key
Cheers On Fri, Mar 15, 2013 at 7:05 AM, Pankaj Gupta <[email protected]> wrote: > Hi Ted, > > Thanks for the response, it does look very relevant. Here's my > understanding, (looking at the relevant code in the patch and around it): > Each StoreFile knows the range of value timestamps that it contains, and it > is kept in its metadata. When the store file is loaded this is available in > the TimeRangeTracker object. When queries with timerange are made to a > StoreFil, it filters them based on the knowledge of timerange values it > contains. Thus if the timerange in query doesn't overlap with timerange of > the store file then it will quickly return none without having to go > through the entire contents of the file. This would mean that on a rowKey + > timeRange query all StoreFiles corresponding to rowKey range will be hit > but the ones that don't have overlapping time range will only result in a > metadata lookup. > > Please correct me if I am wrong. > > Thanks Again, > Pankaj > > > On Thu, Mar 14, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: > > > What you are asking looks similar to this: > > HBASE-5010 Filter HFiles based on TTL > > > > It went into 0.94.0 > > > > Cheers > > > > On Thu, Mar 14, 2013 at 3:53 PM, Pankaj Gupta <[email protected] > > >wrote: > > > > > Hi, > > > > > > I have a question regarding query performance for rows greater than a > > > timestamp. The use case is this: > > > I want to find all the rows in a key range that have changed after a > > > certain timestamp and upto a certain timestamp, i.e. exactly using this > > > SCAN api: > > > Scan setTimeRange(long minStamp, long maxStamp) > > > Get versions of columns only within the specified timestamp > > > range, [minStamp, maxStamp) > > > > > > Would this query go through all the rows in the key range or is there > an > > > optimization that makes it faster. > > > > > > I ask because I read about such an optimization in the following paper: > > > > > > > > > http://oss.csie.fju.edu.tw/~tzu98/Apache%20Hadoop%20Goes%20Realtime%20at%20Facebook.pdf > > > > > > Here is the excerpt: > > > "For data stored in HBase that is time-series or contains a specific, > > > known timestamp, a special timestamp file selection algorithm > > > was added. Since time moves forward and data is rarely inserted > > > at a significantly later time than its timestamp, each HFile will > > > generally contain values for a fixed range of time. This > > > information is stored as metadata in each HFile and queries that > > > ask for a specific timestamp or range of timestamps will check if > > > the request intersects with the ranges of each file, skipping those > > > which do not overlap. " > > > > > > > > > This will work perfectly for my use case but I don't know if this > > > optimization, or any other for this use case, exists in the Apache > HBase. > > > The version of Apache HBASE we are currently using is 0.92.1 but > > > considering moving to 0.94. > > > > > > Thanks, > > > Pankaj > > > > > > -- > > > *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [email protected] > > Pankaj Gupta | Software Engineer > > *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com > > > United States | Canada | United Kingdom | Germany > > > We're hiring< > http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7 > > > ! >
