Khaled: Do you have profiler such as jprofiler ? Profiling would give us more hint.
Otherwise capturing stack trace during the period of reverse scan would help. Cheers On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org> wrote: > You might have the data in the OS buffer cache, without short circuit > reading the region server has to request the block from the data node > process, which then reads it from the block cache. > That is a few context switches per RPC that do not show up in CPU metrics. > In that you also would not see disk IO. > > > If - as you say - you see a lot of evicted blocks the data *has* to come > from the OS. If you do not see disk IO is *has* to come from the OS cache. > I.e. there's more RAM on your boxes, and you should increase the heap block > cache. > > > You can measure the context switches with vmstat. Other than that I have > no suggestion until I reproduce the problem. > Also check the data locality index of the region server it should be close > to 100%. > > > -- Lars > > > > ________________________________ > From: Khaled Elmeleegy <kd...@hotmail.com> > To: "user@hbase.apache.org" <user@hbase.apache.org> > Sent: Thursday, October 2, 2014 3:24 PM > Subject: RE: HBase read performance > > > Lars thanks a lot about all the tips. I'll make sure I cover all of them > and get back to you. I am not sure they are the bottleneck though as they > all are about optimizing physical resource usage. As I said, I don't see > any contended physical resources now. I'll also try to reproduce this > problem in a simpler environment and pass to you the test program to play > with. > > > Couple of high level points to make. You are right that my use case is > kind of a worst case for HBase reads. But, if things go the way you > described them, there should be tons of disk IO and that should be clearly > the bottleneck. This is not the case though. That's for the simple reason > that this is done in a test environment (I am still prototyping), and not a > lot of data is yet written to HBase. However for the real use case, there > should writers constantly writing data to HBase and readers occasionally > doing this scatter/gather. At steady state, things should only get worse > and all the issues you mentioned should get far more pronounced. At this > point, one can try to mitigate it using more memory or so. I am not there > yet as I think I am hitting some software bottleneck, which I don't know > how to work around. > > Khaled > > > > > > > ---------------------------------------- > > Date: Thu, 2 Oct 2014 14:20:47 -0700 > > From: la...@apache.org > > Subject: Re: HBase read performance > > To: user@hbase.apache.org > > > > OK... We might need to investigate this. > > Any chance that you can provide a minimal test program and instruction > about how to set it up. > > We can do some profiling then. > > > > One thing to note is that with scanning HBase cannot use bloom filters > to rule out HFiles ahead of time, it needs to look into all of them. > > So you kind of hit on the absolute worst case: > > - random reads that do not fit into the block cache > > - cannot use bloom filters > > > > > > Few more question/comments: > > - Do you have short circuit reading enabled? If not, you should. > > - Is your table major compacted? That will reduce the number of files to > look at. > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled > by default in HBase, but necessarily in your install of HDFS. > > - Which version of HDFS are you using as backing filesystem? > > - If your disk is idle, it means the data fits into the OS buffer cache. > In turn that means that you increase the heap for the region servers. You > can also use block encoding (FAST_DIFF) to try to make sure the entire > working set fits into the cache. > > > > - Also try to reduce the block size - although if your overall working > set does not fit in the heap it won't help much. > > > > > > This is a good section of the book to read through generally (even > though you might know most of this already): > http://hbase.apache.org/book.html#perf.configurations > > > > > > -- Lars > > > > > > > > ----- Original Message ----- > > From: Khaled Elmeleegy <kd...@hotmail.com> > > To: "user@hbase.apache.org" <user@hbase.apache.org> > > Cc: > > Sent: Thursday, October 2, 2014 11:27 AM > > Subject: RE: HBase read performance > > > > I do see a very brief spike in CPU (user/system), but it's no where near > 0% idle. It goes from 99% idle down to something like 40% idle for a second > or so. The thing to note, this is all on a test cluster, so no real load. > Things are generally idle until i issue 2-3 of these multi-scan-requests to > render a web page. Then, you see the spike in the cpu and some activity in > the network and disk, but nowhere near saturation. > > > > > > If there are specific tests you'd like me to do to debug this, I'd be > more than happy to do it. > > > > > > Khaled > > > > > > > > > > ---------------------------------------- > >> Date: Thu, 2 Oct 2014 11:15:59 -0700 > >> From: la...@apache.org > >> Subject: Re: HBase read performance > >> To: user@hbase.apache.org > >> > >> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting > a lot for IO. No high User/System CPU either? > >> > >> If you see a lot of evicted block then each RPC has a high chance of > requiring to bring an entire 64k block in. You'll see bad performance with > this. > >> > >> We might need to trace this specific scenario. > >> > >> -- Lars > >> > >> > >> > >> ________________________________ > >> From: Khaled Elmeleegy <kd...@hotmail.com> > >> To: "user@hbase.apache.org" <user@hbase.apache.org> > >> Sent: Thursday, October 2, 2014 10:46 AM > >> Subject: RE: HBase read performance > >> > >> > >> I've set the heap size to 6GB and I do have gc logging. No long pauses > there -- occasional 0.1s or 0.2s. > >> > >> Other than the discrepancy between what's reported on the client and > what's reported at the RS, there is also the issue of not getting proper > concurrency. So, even if a reverse get takes 100ms or so (this has to be > mostly blocking on various things as no physical resource is contended), > then the other gets/scans should be able to proceed in parallel, so a > thousand concurrent gets/scans should finish in few hundreds of ms not many > seconds. That's why I thought I'd increase the handlers count to try to get > more concurrency, but it didn't help. So, there must be something else. > >> > >> Khaled > >> > >> ---------------------------------------- > >>> From: ndimi...@gmail.com > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700 > >>> Subject: Re: HBase read performance > >>> To: user@hbase.apache.org > >>> > >>> Do check again on the heap size of the region servers. The default > >>> unconfigured size is 1G; too small for much of anything. Check your RS > logs > >>> -- look for lines produced by the JVMPauseMonitor thread. They usually > >>> correlate with long GC pauses or other process-freeze events. > >>> > >>> Get is implemented as a Scan of a single row, so a reverse scan of a > single > >>> row should be functionally equivalent. > >>> > >>> In practice, I have seen discrepancy between the latencies reported by > the > >>> RS and the latencies experienced by the client. I've not investigated > this > >>> area thoroughly. > >>> > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> > wrote: > >>> > >>>> Thanks Lars for your quick reply. > >>>> > >>>> Yes performance is similar with less handlers (I tried with 100 > first). > >>>> > >>>> The payload is not big ~1KB or so. The working set doesn't seem to > fit in > >>>> memory as there are many cache misses. However, disk is far from > being a > >>>> bottleneck. I checked using iostat. I also verified that neither the > >>>> network nor the CPU of the region server or the client are a > bottleneck. > >>>> This leads me to believe that likely this is a software bottleneck, > >>>> possibly due to a misconfiguration on my side. I just don't know how > to > >>>> debug it. A clear disconnect I see is the individual request latency > as > >>>> reported by metrics on the region server (IPC processCallTime vs > scanNext) > >>>> vs what's measured on the client. Does this sound right? Any ideas on > how > >>>> to better debug it? > >>>> > >>>> About this trick with the timestamps to be able to do a forward scan, > >>>> thanks for pointing it out. Actually, I am aware of it. The problem I > have > >>>> is, sometimes I want to get the key after a particular timestamp and > >>>> sometimes I want to get the key before, so just relying on the key > order > >>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan > can > >>>> do the trick though. > >>>> > >>>> Khaled > >>>> > >>>> ---------------------------------------- > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700 > >>>>> From: la...@apache.org > >>>>> Subject: Re: HBase read performance > >>>>> To: user@hbase.apache.org > >>>>> > >>>>> Hi Khaled, > >>>>> is it the same with fewer threads? 1500 handler threads seems to be a > >>>> lot. Typically a good number of threads depends on the hardware > (number of > >>>> cores, number of spindles, etc). I cannot think of any type of > scenario > >>>> where more than 100 would give any improvement. > >>>>> > >>>>> How large is the payload per KV retrieved that way? If large (as in a > >>>> few 100k) you definitely want to lower the number of the handler > threads. > >>>>> How much heap do you give the region server? Does the working set fit > >>>> into the cache? (i.e. in the metrics, do you see the eviction count > going > >>>> up, if so it does not fit into the cache). > >>>>> > >>>>> If the working set does not fit into the cache (eviction count goes > up) > >>>> then HBase will need to bring a new block in from disk on each Get > >>>> (assuming the Gets are more or less random as far as the server is > >>>> concerned). > >>>>> In case you'll benefit from reducing the HFile block size (from 64k > to > >>>> 8k or even 4k). > >>>>> > >>>>> Lastly I don't think we tested the performance of using reverse scan > >>>> this way, there is probably room to optimize this. > >>>>> Can you restructure your keys to allow forwards scanning? For example > >>>> you could store the time as MAX_LONG-time. Or you could invert all > the bits > >>>> of the time portion of the key, so that it sort the other way. Then > you > >>>> could do a forward scan. > >>>>> > >>>>> Let us know how it goes. > >>>>> > >>>>> -- Lars > >>>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com> > >>>>> To: "user@hbase.apache.org" <user@hbase.apache.org> > >>>>> Cc: > >>>>> Sent: Thursday, October 2, 2014 12:12 AM > >>>>> Subject: HBase read performance > >>>>> > >>>>> Hi, > >>>>> > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have > a > >>>> client reading ~1000 keys from an HBase table. These keys happen to > fall on > >>>> the same region server. For my reads I use reverse scan to read each > key as > >>>> I want the key prior to a specific time stamp (time stamps are stored > in > >>>> reverse order). I don't believe gets can accomplish that, right? so I > use > >>>> scan, with caching set to 1. > >>>>> > >>>>> I use 2000 reader threads in the client and on HBase, I've set > >>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter > >>>> gather is very slow and can take up to 10s in total. Timing an > individual > >>>> getScanner(..) call on the client side, it can easily take few > hundreds of > >>>> ms. I also got the following metrics from the region server in > question: > >>>>> > >>>>> "queueCallTime_mean" : 2.190855525775637, > >>>>> "queueCallTime_median" : 0.0, > >>>>> "queueCallTime_75th_percentile" : 0.0, > >>>>> "queueCallTime_95th_percentile" : 1.0, > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818, > >>>>> > >>>>> "processCallTime_min" : 0, > >>>>> "processCallTime_max" : 12755, > >>>>> "processCallTime_mean" : 105.64873440912682, > >>>>> "processCallTime_median" : 0.0, > >>>>> "processCallTime_75th_percentile" : 2.0, > >>>>> "processCallTime_95th_percentile" : 7917.95, > >>>>> "processCallTime_99th_percentile" : 8876.89, > >>>>> > >>>>> > >>>> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" > >>>> : 89, > >>>>> > >>>> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" > >>>> : 11300, > >>>>> > >>>> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" > >>>> : 654.4949739797315, > >>>>> > >>>> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" > >>>> : 101.0, > >>>>> > >>>> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" > >>>> : 101.0, > >>>>> > >>>> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" > >>>> : 101.0, > >>>>> > >>>> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" > >>>> : 113.0, > >>>>> > >>>>> Where "delta" is the name of the table I am querying. > >>>>> > >>>>> In addition to all this, i monitored the hardware resources (CPU, > disk, > >>>> and network) of both the client and the region server and nothing > seems > >>>> anywhere near saturation. So I am puzzled by what's going on and > where this > >>>> time is going. > >>>>> > >>>>> Few things to note based on the above measurements: both medians of > IPC > >>>> processCallTime and queueCallTime are basically zero (ms I presume, > >>>> right?). However, scanNext_median is 101 (ms too, right?). I am not > sure > >>>> how this adds up. Also, even though the 101 figure seems outrageously > high > >>>> and I don't know why, still all these scans should be happening in > >>>> parallel, so the overall call should finish fast, given that no > hardware > >>>> resource is contended, right? but this is not what's happening, so I > have > >>>> to be missing something(s). > >>>>> > >>>>> So, any help is appreciated there. > >>>>> > >>>>> Thanks, > >>>>> Khaled > >>>> > >>>> >