Khaled:
Do you have profiler such as jprofiler ?
Profiling would give us more hint.

Otherwise capturing stack trace during the period of reverse scan would
help.

Cheers

On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org> wrote:

> You might have the data in the OS buffer cache, without short circuit
> reading the region server has to request the block from the data node
> process, which then reads it from the block cache.
> That is a few context switches per RPC that do not show up in CPU metrics.
> In that you also would not see disk IO.
>
>
> If - as you say - you see a lot of evicted blocks the data *has* to come
> from the OS. If you do not see disk IO is *has* to come from the OS cache.
> I.e. there's more RAM on your boxes, and you should increase the heap block
> cache.
>
>
> You can measure the context switches with vmstat. Other than that I have
> no suggestion until I reproduce the problem.
> Also check the data locality index of the region server it should be close
> to 100%.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Thursday, October 2, 2014 3:24 PM
> Subject: RE: HBase read performance
>
>
> Lars thanks a lot about all the tips. I'll make sure I cover all of them
> and get back to you. I am not sure they are the bottleneck though as they
> all are about optimizing physical resource usage. As I said, I don't see
> any contended physical resources now. I'll also try to reproduce this
> problem in a simpler environment and pass to you the test program to play
> with.
>
>
> Couple of high level points to make. You are right that my use case is
> kind of a worst case for HBase reads. But, if things go the way you
> described them, there should be tons of disk IO and that should be clearly
> the bottleneck. This is not the case though. That's for the simple reason
> that this is done in a test environment (I am still prototyping), and not a
> lot of data is yet written to HBase. However for the real use case, there
> should writers constantly writing data to HBase and readers occasionally
> doing this scatter/gather. At steady state, things should only get worse
> and all the issues you mentioned should get far more pronounced. At this
> point, one can try to mitigate it using more memory or so. I am not there
> yet as I think I am hitting some software bottleneck, which I don't know
> how to work around.
>
> Khaled
>
>
>
>
>
>
> ----------------------------------------
> > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > From: la...@apache.org
> > Subject: Re: HBase read performance
> > To: user@hbase.apache.org
> >
> > OK... We might need to investigate this.
> > Any chance that you can provide a minimal test program and instruction
> about how to set it up.
> > We can do some profiling then.
> >
> > One thing to note is that with scanning HBase cannot use bloom filters
> to rule out HFiles ahead of time, it needs to look into all of them.
> > So you kind of hit on the absolute worst case:
> > - random reads that do not fit into the block cache
> > - cannot use bloom filters
> >
> >
> > Few more question/comments:
> > - Do you have short circuit reading enabled? If not, you should.
> > - Is your table major compacted? That will reduce the number of files to
> look at.
> > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled
> by default in HBase, but necessarily in your install of HDFS.
> > - Which version of HDFS are you using as backing filesystem?
> > - If your disk is idle, it means the data fits into the OS buffer cache.
> In turn that means that you increase the heap for the region servers. You
> can also use block encoding (FAST_DIFF) to try to make sure the entire
> working set fits into the cache.
> >
> > - Also try to reduce the block size - although if your overall working
> set does not fit in the heap it won't help much.
> >
> >
> > This is a good section of the book to read through generally (even
> though you might know most of this already):
> http://hbase.apache.org/book.html#perf.configurations
> >
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Khaled Elmeleegy <kd...@hotmail.com>
> > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > Cc:
> > Sent: Thursday, October 2, 2014 11:27 AM
> > Subject: RE: HBase read performance
> >
> > I do see a very brief spike in CPU (user/system), but it's no where near
> 0% idle. It goes from 99% idle down to something like 40% idle for a second
> or so. The thing to note, this is all on a test cluster, so no real load.
> Things are generally idle until i issue 2-3 of these multi-scan-requests to
> render a web page. Then, you see the spike in the cpu and some activity in
> the network and disk, but nowhere near saturation.
> >
> >
> > If there are specific tests you'd like me to do to debug this, I'd be
> more than happy to do it.
> >
> >
> > Khaled
> >
> >
> >
> >
> > ----------------------------------------
> >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> >> From: la...@apache.org
> >> Subject: Re: HBase read performance
> >> To: user@hbase.apache.org
> >>
> >> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting
> a lot for IO. No high User/System CPU either?
> >>
> >> If you see a lot of evicted block then each RPC has a high chance of
> requiring to bring an entire 64k block in. You'll see bad performance with
> this.
> >>
> >> We might need to trace this specific scenario.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Khaled Elmeleegy <kd...@hotmail.com>
> >> To: "user@hbase.apache.org" <user@hbase.apache.org>
> >> Sent: Thursday, October 2, 2014 10:46 AM
> >> Subject: RE: HBase read performance
> >>
> >>
> >> I've set the heap size to 6GB and I do have gc logging. No long pauses
> there -- occasional 0.1s or 0.2s.
> >>
> >> Other than the discrepancy between what's reported on the client and
> what's reported at the RS, there is also the issue of not getting proper
> concurrency. So, even if a reverse get takes 100ms or so (this has to be
> mostly blocking on various things as no physical resource is contended),
> then the other gets/scans should be able to proceed in parallel, so a
> thousand concurrent gets/scans should finish in few hundreds of ms not many
> seconds. That's why I thought I'd increase the handlers count to try to get
> more concurrency, but it didn't help. So, there must be something else.
> >>
> >> Khaled
> >>
> >> ----------------------------------------
> >>> From: ndimi...@gmail.com
> >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> >>> Subject: Re: HBase read performance
> >>> To: user@hbase.apache.org
> >>>
> >>> Do check again on the heap size of the region servers. The default
> >>> unconfigured size is 1G; too small for much of anything. Check your RS
> logs
> >>> -- look for lines produced by the JVMPauseMonitor thread. They usually
> >>> correlate with long GC pauses or other process-freeze events.
> >>>
> >>> Get is implemented as a Scan of a single row, so a reverse scan of a
> single
> >>> row should be functionally equivalent.
> >>>
> >>> In practice, I have seen discrepancy between the latencies reported by
> the
> >>> RS and the latencies experienced by the client. I've not investigated
> this
> >>> area thoroughly.
> >>>
> >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> >>>
> >>>> Thanks Lars for your quick reply.
> >>>>
> >>>> Yes performance is similar with less handlers (I tried with 100
> first).
> >>>>
> >>>> The payload is not big ~1KB or so. The working set doesn't seem to
> fit in
> >>>> memory as there are many cache misses. However, disk is far from
> being a
> >>>> bottleneck. I checked using iostat. I also verified that neither the
> >>>> network nor the CPU of the region server or the client are a
> bottleneck.
> >>>> This leads me to believe that likely this is a software bottleneck,
> >>>> possibly due to a misconfiguration on my side. I just don't know how
> to
> >>>> debug it. A clear disconnect I see is the individual request latency
> as
> >>>> reported by metrics on the region server (IPC processCallTime vs
> scanNext)
> >>>> vs what's measured on the client. Does this sound right? Any ideas on
> how
> >>>> to better debug it?
> >>>>
> >>>> About this trick with the timestamps to be able to do a forward scan,
> >>>> thanks for pointing it out. Actually, I am aware of it. The problem I
> have
> >>>> is, sometimes I want to get the key after a particular timestamp and
> >>>> sometimes I want to get the key before, so just relying on the key
> order
> >>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan
> can
> >>>> do the trick though.
> >>>>
> >>>> Khaled
> >>>>
> >>>> ----------------------------------------
> >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> >>>>> From: la...@apache.org
> >>>>> Subject: Re: HBase read performance
> >>>>> To: user@hbase.apache.org
> >>>>>
> >>>>> Hi Khaled,
> >>>>> is it the same with fewer threads? 1500 handler threads seems to be a
> >>>> lot. Typically a good number of threads depends on the hardware
> (number of
> >>>> cores, number of spindles, etc). I cannot think of any type of
> scenario
> >>>> where more than 100 would give any improvement.
> >>>>>
> >>>>> How large is the payload per KV retrieved that way? If large (as in a
> >>>> few 100k) you definitely want to lower the number of the handler
> threads.
> >>>>> How much heap do you give the region server? Does the working set fit
> >>>> into the cache? (i.e. in the metrics, do you see the eviction count
> going
> >>>> up, if so it does not fit into the cache).
> >>>>>
> >>>>> If the working set does not fit into the cache (eviction count goes
> up)
> >>>> then HBase will need to bring a new block in from disk on each Get
> >>>> (assuming the Gets are more or less random as far as the server is
> >>>> concerned).
> >>>>> In case you'll benefit from reducing the HFile block size (from 64k
> to
> >>>> 8k or even 4k).
> >>>>>
> >>>>> Lastly I don't think we tested the performance of using reverse scan
> >>>> this way, there is probably room to optimize this.
> >>>>> Can you restructure your keys to allow forwards scanning? For example
> >>>> you could store the time as MAX_LONG-time. Or you could invert all
> the bits
> >>>> of the time portion of the key, so that it sort the other way. Then
> you
> >>>> could do a forward scan.
> >>>>>
> >>>>> Let us know how it goes.
> >>>>>
> >>>>> -- Lars
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> >>>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
> >>>>> Cc:
> >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> >>>>> Subject: HBase read performance
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have
> a
> >>>> client reading ~1000 keys from an HBase table. These keys happen to
> fall on
> >>>> the same region server. For my reads I use reverse scan to read each
> key as
> >>>> I want the key prior to a specific time stamp (time stamps are stored
> in
> >>>> reverse order). I don't believe gets can accomplish that, right? so I
> use
> >>>> scan, with caching set to 1.
> >>>>>
> >>>>> I use 2000 reader threads in the client and on HBase, I've set
> >>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
> >>>> gather is very slow and can take up to 10s in total. Timing an
> individual
> >>>> getScanner(..) call on the client side, it can easily take few
> hundreds of
> >>>> ms. I also got the following metrics from the region server in
> question:
> >>>>>
> >>>>> "queueCallTime_mean" : 2.190855525775637,
> >>>>> "queueCallTime_median" : 0.0,
> >>>>> "queueCallTime_75th_percentile" : 0.0,
> >>>>> "queueCallTime_95th_percentile" : 1.0,
> >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> >>>>>
> >>>>> "processCallTime_min" : 0,
> >>>>> "processCallTime_max" : 12755,
> >>>>> "processCallTime_mean" : 105.64873440912682,
> >>>>> "processCallTime_median" : 0.0,
> >>>>> "processCallTime_75th_percentile" : 2.0,
> >>>>> "processCallTime_95th_percentile" : 7917.95,
> >>>>> "processCallTime_99th_percentile" : 8876.89,
> >>>>>
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> >>>> : 89,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> >>>> : 11300,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> >>>> : 654.4949739797315,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> >>>> : 101.0,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> >>>> : 101.0,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> >>>> : 101.0,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> >>>> : 113.0,
> >>>>>
> >>>>> Where "delta" is the name of the table I am querying.
> >>>>>
> >>>>> In addition to all this, i monitored the hardware resources (CPU,
> disk,
> >>>> and network) of both the client and the region server and nothing
> seems
> >>>> anywhere near saturation. So I am puzzled by what's going on and
> where this
> >>>> time is going.
> >>>>>
> >>>>> Few things to note based on the above measurements: both medians of
> IPC
> >>>> processCallTime and queueCallTime are basically zero (ms I presume,
> >>>> right?). However, scanNext_median is 101 (ms too, right?). I am not
> sure
> >>>> how this adds up. Also, even though the 101 figure seems outrageously
> high
> >>>> and I don't know why, still all these scans should be happening in
> >>>> parallel, so the overall call should finish fast, given that no
> hardware
> >>>> resource is contended, right? but this is not what's happening, so I
> have
> >>>> to be missing something(s).
> >>>>>
> >>>>> So, any help is appreciated there.
> >>>>>
> >>>>> Thanks,
> >>>>> Khaled
> >>>>
> >>>>
>

Reply via email to