No, I was using reverse scan to mimic the behavior of reverse get.
---------------------------------------- > Date: Thu, 2 Oct 2014 10:50:15 -0700 > Subject: Re: HBase read performance > From: yuzhih...@gmail.com > To: user@hbase.apache.org > > Khaled: > Were you using this method from HTable ? > > public Result[] get(List<Get> gets) throws IOException { > > Cheers > > On Thu, Oct 2, 2014 at 10:46 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote: > >> I've set the heap size to 6GB and I do have gc logging. No long pauses >> there -- occasional 0.1s or 0.2s. >> >> Other than the discrepancy between what's reported on the client and >> what's reported at the RS, there is also the issue of not getting proper >> concurrency. So, even if a reverse get takes 100ms or so (this has to be >> mostly blocking on various things as no physical resource is contended), >> then the other gets/scans should be able to proceed in parallel, so a >> thousand concurrent gets/scans should finish in few hundreds of ms not many >> seconds. That's why I thought I'd increase the handlers count to try to get >> more concurrency, but it didn't help. So, there must be something else. >> >> Khaled >> >> ---------------------------------------- >>> From: ndimi...@gmail.com >>> Date: Thu, 2 Oct 2014 10:36:39 -0700 >>> Subject: Re: HBase read performance >>> To: user@hbase.apache.org >>> >>> Do check again on the heap size of the region servers. The default >>> unconfigured size is 1G; too small for much of anything. Check your RS >> logs >>> -- look for lines produced by the JVMPauseMonitor thread. They usually >>> correlate with long GC pauses or other process-freeze events. >>> >>> Get is implemented as a Scan of a single row, so a reverse scan of a >> single >>> row should be functionally equivalent. >>> >>> In practice, I have seen discrepancy between the latencies reported by >> the >>> RS and the latencies experienced by the client. I've not investigated >> this >>> area thoroughly. >>> >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> >> wrote: >>> >>>> Thanks Lars for your quick reply. >>>> >>>> Yes performance is similar with less handlers (I tried with 100 first). >>>> >>>> The payload is not big ~1KB or so. The working set doesn't seem to fit >> in >>>> memory as there are many cache misses. However, disk is far from being a >>>> bottleneck. I checked using iostat. I also verified that neither the >>>> network nor the CPU of the region server or the client are a bottleneck. >>>> This leads me to believe that likely this is a software bottleneck, >>>> possibly due to a misconfiguration on my side. I just don't know how to >>>> debug it. A clear disconnect I see is the individual request latency as >>>> reported by metrics on the region server (IPC processCallTime vs >> scanNext) >>>> vs what's measured on the client. Does this sound right? Any ideas on >> how >>>> to better debug it? >>>> >>>> About this trick with the timestamps to be able to do a forward scan, >>>> thanks for pointing it out. Actually, I am aware of it. The problem I >> have >>>> is, sometimes I want to get the key after a particular timestamp and >>>> sometimes I want to get the key before, so just relying on the key order >>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan >> can >>>> do the trick though. >>>> >>>> Khaled >>>> >>>> ---------------------------------------- >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700 >>>>> From: la...@apache.org >>>>> Subject: Re: HBase read performance >>>>> To: user@hbase.apache.org >>>>> >>>>> Hi Khaled, >>>>> is it the same with fewer threads? 1500 handler threads seems to be a >>>> lot. Typically a good number of threads depends on the hardware (number >> of >>>> cores, number of spindles, etc). I cannot think of any type of scenario >>>> where more than 100 would give any improvement. >>>>> >>>>> How large is the payload per KV retrieved that way? If large (as in a >>>> few 100k) you definitely want to lower the number of the handler >> threads. >>>>> How much heap do you give the region server? Does the working set fit >>>> into the cache? (i.e. in the metrics, do you see the eviction count >> going >>>> up, if so it does not fit into the cache). >>>>> >>>>> If the working set does not fit into the cache (eviction count goes up) >>>> then HBase will need to bring a new block in from disk on each Get >>>> (assuming the Gets are more or less random as far as the server is >>>> concerned). >>>>> In case you'll benefit from reducing the HFile block size (from 64k to >>>> 8k or even 4k). >>>>> >>>>> Lastly I don't think we tested the performance of using reverse scan >>>> this way, there is probably room to optimize this. >>>>> Can you restructure your keys to allow forwards scanning? For example >>>> you could store the time as MAX_LONG-time. Or you could invert all the >> bits >>>> of the time portion of the key, so that it sort the other way. Then you >>>> could do a forward scan. >>>>> >>>>> Let us know how it goes. >>>>> >>>>> -- Lars >>>>> >>>>> >>>>> ----- Original Message ----- >>>>> From: Khaled Elmeleegy <kd...@hotmail.com> >>>>> To: "user@hbase.apache.org" <user@hbase.apache.org> >>>>> Cc: >>>>> Sent: Thursday, October 2, 2014 12:12 AM >>>>> Subject: HBase read performance >>>>> >>>>> Hi, >>>>> >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a >>>> client reading ~1000 keys from an HBase table. These keys happen to >> fall on >>>> the same region server. For my reads I use reverse scan to read each >> key as >>>> I want the key prior to a specific time stamp (time stamps are stored in >>>> reverse order). I don't believe gets can accomplish that, right? so I >> use >>>> scan, with caching set to 1. >>>>> >>>>> I use 2000 reader threads in the client and on HBase, I've set >>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter >>>> gather is very slow and can take up to 10s in total. Timing an >> individual >>>> getScanner(..) call on the client side, it can easily take few hundreds >> of >>>> ms. I also got the following metrics from the region server in question: >>>>> >>>>> "queueCallTime_mean" : 2.190855525775637, >>>>> "queueCallTime_median" : 0.0, >>>>> "queueCallTime_75th_percentile" : 0.0, >>>>> "queueCallTime_95th_percentile" : 1.0, >>>>> "queueCallTime_99th_percentile" : 556.9799999999818, >>>>> >>>>> "processCallTime_min" : 0, >>>>> "processCallTime_max" : 12755, >>>>> "processCallTime_mean" : 105.64873440912682, >>>>> "processCallTime_median" : 0.0, >>>>> "processCallTime_75th_percentile" : 2.0, >>>>> "processCallTime_95th_percentile" : 7917.95, >>>>> "processCallTime_99th_percentile" : 8876.89, >>>>> >>>>> >>>> >> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" >>>> : 89, >>>>> >>>> >> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" >>>> : 11300, >>>>> >>>> >> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" >>>> : 654.4949739797315, >>>>> >>>> >> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" >>>> : 101.0, >>>>> >>>> >> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" >>>> : 101.0, >>>>> >>>> >> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" >>>> : 101.0, >>>>> >>>> >> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" >>>> : 113.0, >>>>> >>>>> Where "delta" is the name of the table I am querying. >>>>> >>>>> In addition to all this, i monitored the hardware resources (CPU, disk, >>>> and network) of both the client and the region server and nothing seems >>>> anywhere near saturation. So I am puzzled by what's going on and where >> this >>>> time is going. >>>>> >>>>> Few things to note based on the above measurements: both medians of IPC >>>> processCallTime and queueCallTime are basically zero (ms I presume, >>>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure >>>> how this adds up. Also, even though the 101 figure seems outrageously >> high >>>> and I don't know why, still all these scans should be happening in >>>> parallel, so the overall call should finish fast, given that no hardware >>>> resource is contended, right? but this is not what's happening, so I >> have >>>> to be missing something(s). >>>>> >>>>> So, any help is appreciated there. >>>>> >>>>> Thanks, >>>>> Khaled >>>> >>>> >> >>