Thanks Nicolas, Qiang.
I was able to write a simple program that reproduces the problem on a tiny 
HBase cluster on ec2. The cluster has 2 m1.large nodes. One node runs the 
master, name node and zookeeper. The other node runs a data node and a region 
server, with heap size configured to be 6GB. There, the 1000 parallel reverse 
gets (reverse scans) take 7-8 seconds. The data set is tiny (10M records, each 
having a small number of bytes). As I said before, all hardware resources are 
very idle there.
Interestingly, running the same workload on my macbook, the 1000 parallel gets 
take ~200ms on a pseudo-distributed installation.
Any help to resolve this mystery is highly appreciated.
P.S. please find my test program attached.
Best,Khaled

> From: nkey...@gmail.com
> Date: Mon, 6 Oct 2014 09:40:48 +0200
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
> 
> Hi,
> 
> I haven't seen it mentioned, but if I understand correctly each scan
> returns a single row? If so you should use Scan#setSmall to save some rpc
> calls.
> 
> Cheers,
> 
> Nicolas
> 
> 
> On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <tian...@gmail.com> wrote:
> 
> > when using separate HConnection instance, both its
> > RpcClient instance(maintain connection to a  regionserver) and Registry
> > instance(maintain  connection to zookeeper) will be separate..
> >
> > see
> >
> > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> >
> >
> >
> >
> >
> > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <kd...@hotmail.com>
> > wrote:
> >
> > > I tried creating my own HConnections pool to use for my HBase calls, so
> > > that not all the (2K) threads share the same HConnection. However, I
> > could
> > > only have 10 HConnections. Beyond that I get ZK exceptions, please find
> > it
> > > below. Also, with 10 HConnections, I don't see noticeable improvement in
> > > performance so far.
> > > 2014-10-05 06:11:26,490 WARN  [main] zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:26,490
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:27,849
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:30,405
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:35,279
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:44,393
> > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists failed
> > > after 4 attempts2014-10-05 06:11:44,394 WARN  [main] zookeeper.ZKUtil
> > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode
> > >
> > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)   at
> > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > >     at
> > >
> > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > >       at
> > >
> > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > >    at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > >  at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > >     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > Method)        at
> > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > at
> > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > >     at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > From: kd...@hotmail.com
> > > > To: user@hbase.apache.org
> > > > Subject: RE: HBase read performance
> > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > >
> > > > Lars, Ted, and Qiang,
> > > > Thanks for all the input.
> > > > Qiang: yes all the threads are in the same client process sharing the
> > > same connection. And since I don't see hardware contention, may be there
> > is
> > > contention over this code path. I'll try using many connections and see
> > if
> > > it alleviates the problems and I'll report back.
> > > > Thanks again,Khaled
> > > >
> > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > Subject: Re: HBase read performance
> > > > > From: tian...@gmail.com
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Regarding to profiling, Andrew introduced
> > > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > months
> > > > > ago.
> > > > >
> > > > > processCallTime comes from RpcServer#call, so it looks good?
> > > > >
> > > > > I have a suspect: https://issues.apache.org/jira/browse/HBASE-11306
> > > > >
> > > > > how many processes do you have for your 2000 threads?
> > > > > if olny 1 process, those threads will share just 1 connection to that
> > > > > regionserver, there might be big contention on the RPC code path.
> > > ---for
> > > > > such case, could you try using different connections?
> > > > >
> > >
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > > > >
> > > > > > Khaled:
> > > > > > Do you have profiler such as jprofiler ?
> > > > > > Profiling would give us more hint.
> > > > > >
> > > > > > Otherwise capturing stack trace during the period of reverse scan
> > > would
> > > > > > help.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > You might have the data in the OS buffer cache, without short
> > > circuit
> > > > > > > reading the region server has to request the block from the data
> > > node
> > > > > > > process, which then reads it from the block cache.
> > > > > > > That is a few context switches per RPC that do not show up in CPU
> > > > > > metrics.
> > > > > > > In that you also would not see disk IO.
> > > > > > >
> > > > > > >
> > > > > > > If - as you say - you see a lot of evicted blocks the data *has*
> > > to come
> > > > > > > from the OS. If you do not see disk IO is *has* to come from the
> > OS
> > > > > > cache.
> > > > > > > I.e. there's more RAM on your boxes, and you should increase the
> > > heap
> > > > > > block
> > > > > > > cache.
> > > > > > >
> > > > > > >
> > > > > > > You can measure the context switches with vmstat. Other than that
> > > I have
> > > > > > > no suggestion until I reproduce the problem.
> > > > > > > Also check the data locality index of the region server it should
> > > be
> > > > > > close
> > > > > > > to 100%.
> > > > > > >
> > > > > > >
> > > > > > > -- Lars
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > Subject: RE: HBase read performance
> > > > > > >
> > > > > > >
> > > > > > > Lars thanks a lot about all the tips. I'll make sure I cover all
> > > of them
> > > > > > > and get back to you. I am not sure they are the bottleneck though
> > > as they
> > > > > > > all are about optimizing physical resource usage. As I said, I
> > > don't see
> > > > > > > any contended physical resources now. I'll also try to reproduce
> > > this
> > > > > > > problem in a simpler environment and pass to you the test program
> > > to play
> > > > > > > with.
> > > > > > >
> > > > > > >
> > > > > > > Couple of high level points to make. You are right that my use
> > > case is
> > > > > > > kind of a worst case for HBase reads. But, if things go the way
> > you
> > > > > > > described them, there should be tons of disk IO and that should
> > be
> > > > > > clearly
> > > > > > > the bottleneck. This is not the case though. That's for the
> > simple
> > > reason
> > > > > > > that this is done in a test environment (I am still prototyping),
> > > and
> > > > > > not a
> > > > > > > lot of data is yet written to HBase. However for the real use
> > > case, there
> > > > > > > should writers constantly writing data to HBase and readers
> > > occasionally
> > > > > > > doing this scatter/gather. At steady state, things should only
> > get
> > > worse
> > > > > > > and all the issues you mentioned should get far more pronounced.
> > > At this
> > > > > > > point, one can try to mitigate it using more memory or so. I am
> > > not there
> > > > > > > yet as I think I am hitting some software bottleneck, which I
> > > don't know
> > > > > > > how to work around.
> > > > > > >
> > > > > > > Khaled
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----------------------------------------
> > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > From: la...@apache.org
> > > > > > > > Subject: Re: HBase read performance
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > OK... We might need to investigate this.
> > > > > > > > Any chance that you can provide a minimal test program and
> > > instruction
> > > > > > > about how to set it up.
> > > > > > > > We can do some profiling then.
> > > > > > > >
> > > > > > > > One thing to note is that with scanning HBase cannot use bloom
> > > filters
> > > > > > > to rule out HFiles ahead of time, it needs to look into all of
> > > them.
> > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > - cannot use bloom filters
> > > > > > > >
> > > > > > > >
> > > > > > > > Few more question/comments:
> > > > > > > > - Do you have short circuit reading enabled? If not, you
> > should.
> > > > > > > > - Is your table major compacted? That will reduce the number of
> > > files
> > > > > > to
> > > > > > > look at.
> > > > > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It
> > > disabled
> > > > > > > by default in HBase, but necessarily in your install of HDFS.
> > > > > > > > - Which version of HDFS are you using as backing filesystem?
> > > > > > > > - If your disk is idle, it means the data fits into the OS
> > buffer
> > > > > > cache.
> > > > > > > In turn that means that you increase the heap for the region
> > > servers. You
> > > > > > > can also use block encoding (FAST_DIFF) to try to make sure the
> > > entire
> > > > > > > working set fits into the cache.
> > > > > > > >
> > > > > > > > - Also try to reduce the block size - although if your overall
> > > working
> > > > > > > set does not fit in the heap it won't help much.
> > > > > > > >
> > > > > > > >
> > > > > > > > This is a good section of the book to read through generally
> > > (even
> > > > > > > though you might know most of this already):
> > > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > >
> > > > > > > >
> > > > > > > > -- Lars
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > > > > > > Cc:
> > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > Subject: RE: HBase read performance
> > > > > > > >
> > > > > > > > I do see a very brief spike in CPU (user/system), but it's no
> > > where
> > > > > > near
> > > > > > > 0% idle. It goes from 99% idle down to something like 40% idle
> > for
> > > a
> > > > > > second
> > > > > > > or so. The thing to note, this is all on a test cluster, so no
> > > real load.
> > > > > > > Things are generally idle until i issue 2-3 of these
> > > multi-scan-requests
> > > > > > to
> > > > > > > render a web page. Then, you see the spike in the cpu and some
> > > activity
> > > > > > in
> > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > >
> > > > > > > >
> > > > > > > > If there are specific tests you'd like me to do to debug this,
> > > I'd be
> > > > > > > more than happy to do it.
> > > > > > > >
> > > > > > > >
> > > > > > > > Khaled
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----------------------------------------
> > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > >> From: la...@apache.org
> > > > > > > >> Subject: Re: HBase read performance
> > > > > > > >> To: user@hbase.apache.org
> > > > > > > >>
> > > > > > > >> I still think you're waiting on disk. No IOWAIT? So CPU is not
> > > waiting
> > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > >>
> > > > > > > >> If you see a lot of evicted block then each RPC has a high
> > > chance of
> > > > > > > requiring to bring an entire 64k block in. You'll see bad
> > > performance
> > > > > > with
> > > > > > > this.
> > > > > > > >>
> > > > > > > >> We might need to trace this specific scenario.
> > > > > > > >>
> > > > > > > >> -- Lars
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> ________________________________
> > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > >> To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > >> Subject: RE: HBase read performance
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> I've set the heap size to 6GB and I do have gc logging. No
> > long
> > > pauses
> > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > >>
> > > > > > > >> Other than the discrepancy between what's reported on the
> > > client and
> > > > > > > what's reported at the RS, there is also the issue of not getting
> > > proper
> > > > > > > concurrency. So, even if a reverse get takes 100ms or so (this
> > has
> > > to be
> > > > > > > mostly blocking on various things as no physical resource is
> > > contended),
> > > > > > > then the other gets/scans should be able to proceed in parallel,
> > > so a
> > > > > > > thousand concurrent gets/scans should finish in few hundreds of
> > ms
> > > not
> > > > > > many
> > > > > > > seconds. That's why I thought I'd increase the handlers count to
> > > try to
> > > > > > get
> > > > > > > more concurrency, but it didn't help. So, there must be something
> > > else.
> > > > > > > >>
> > > > > > > >> Khaled
> > > > > > > >>
> > > > > > > >> ----------------------------------------
> > > > > > > >>> From: ndimi...@gmail.com
> > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > >>> To: user@hbase.apache.org
> > > > > > > >>>
> > > > > > > >>> Do check again on the heap size of the region servers. The
> > > default
> > > > > > > >>> unconfigured size is 1G; too small for much of anything.
> > Check
> > > your
> > > > > > RS
> > > > > > > logs
> > > > > > > >>> -- look for lines produced by the JVMPauseMonitor thread.
> > They
> > > > > > usually
> > > > > > > >>> correlate with long GC pauses or other process-freeze events.
> > > > > > > >>>
> > > > > > > >>> Get is implemented as a Scan of a single row, so a reverse
> > > scan of a
> > > > > > > single
> > > > > > > >>> row should be functionally equivalent.
> > > > > > > >>>
> > > > > > > >>> In practice, I have seen discrepancy between the latencies
> > > reported
> > > > > > by
> > > > > > > the
> > > > > > > >>> RS and the latencies experienced by the client. I've not
> > > investigated
> > > > > > > this
> > > > > > > >>> area thoroughly.
> > > > > > > >>>
> > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> > > kd...@hotmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > >>>>
> > > > > > > >>>> Yes performance is similar with less handlers (I tried with
> > > 100
> > > > > > > first).
> > > > > > > >>>>
> > > > > > > >>>> The payload is not big ~1KB or so. The working set doesn't
> > > seem to
> > > > > > > fit in
> > > > > > > >>>> memory as there are many cache misses. However, disk is far
> > > from
> > > > > > > being a
> > > > > > > >>>> bottleneck. I checked using iostat. I also verified that
> > > neither the
> > > > > > > >>>> network nor the CPU of the region server or the client are a
> > > > > > > bottleneck.
> > > > > > > >>>> This leads me to believe that likely this is a software
> > > bottleneck,
> > > > > > > >>>> possibly due to a misconfiguration on my side. I just don't
> > > know how
> > > > > > > to
> > > > > > > >>>> debug it. A clear disconnect I see is the individual request
> > > latency
> > > > > > > as
> > > > > > > >>>> reported by metrics on the region server (IPC
> > processCallTime
> > > vs
> > > > > > > scanNext)
> > > > > > > >>>> vs what's measured on the client. Does this sound right? Any
> > > ideas
> > > > > > on
> > > > > > > how
> > > > > > > >>>> to better debug it?
> > > > > > > >>>>
> > > > > > > >>>> About this trick with the timestamps to be able to do a
> > > forward
> > > > > > scan,
> > > > > > > >>>> thanks for pointing it out. Actually, I am aware of it. The
> > > problem
> > > > > > I
> > > > > > > have
> > > > > > > >>>> is, sometimes I want to get the key after a particular
> > > timestamp and
> > > > > > > >>>> sometimes I want to get the key before, so just relying on
> > > the key
> > > > > > > order
> > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought
> > > reverse
> > > > > > scan
> > > > > > > can
> > > > > > > >>>> do the trick though.
> > > > > > > >>>>
> > > > > > > >>>> Khaled
> > > > > > > >>>>
> > > > > > > >>>> ----------------------------------------
> > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > >>>>> From: la...@apache.org
> > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > >>>>>
> > > > > > > >>>>> Hi Khaled,
> > > > > > > >>>>> is it the same with fewer threads? 1500 handler threads
> > > seems to
> > > > > > be a
> > > > > > > >>>> lot. Typically a good number of threads depends on the
> > > hardware
> > > > > > > (number of
> > > > > > > >>>> cores, number of spindles, etc). I cannot think of any type
> > of
> > > > > > > scenario
> > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > >>>>>
> > > > > > > >>>>> How large is the payload per KV retrieved that way? If
> > large
> > > (as
> > > > > > in a
> > > > > > > >>>> few 100k) you definitely want to lower the number of the
> > > handler
> > > > > > > threads.
> > > > > > > >>>>> How much heap do you give the region server? Does the
> > > working set
> > > > > > fit
> > > > > > > >>>> into the cache? (i.e. in the metrics, do you see the
> > eviction
> > > count
> > > > > > > going
> > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > >>>>>
> > > > > > > >>>>> If the working set does not fit into the cache (eviction
> > > count goes
> > > > > > > up)
> > > > > > > >>>> then HBase will need to bring a new block in from disk on
> > > each Get
> > > > > > > >>>> (assuming the Gets are more or less random as far as the
> > > server is
> > > > > > > >>>> concerned).
> > > > > > > >>>>> In case you'll benefit from reducing the HFile block size
> > > (from 64k
> > > > > > > to
> > > > > > > >>>> 8k or even 4k).
> > > > > > > >>>>>
> > > > > > > >>>>> Lastly I don't think we tested the performance of using
> > > reverse
> > > > > > scan
> > > > > > > >>>> this way, there is probably room to optimize this.
> > > > > > > >>>>> Can you restructure your keys to allow forwards scanning?
> > For
> > > > > > example
> > > > > > > >>>> you could store the time as MAX_LONG-time. Or you could
> > > invert all
> > > > > > > the bits
> > > > > > > >>>> of the time portion of the key, so that it sort the other
> > > way. Then
> > > > > > > you
> > > > > > > >>>> could do a forward scan.
> > > > > > > >>>>>
> > > > > > > >>>>> Let us know how it goes.
> > > > > > > >>>>>
> > > > > > > >>>>> -- Lars
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>> ----- Original Message -----
> > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > >>>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > > > > > >>>>> Cc:
> > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > >>>>> Subject: HBase read performance
> > > > > > > >>>>>
> > > > > > > >>>>> Hi,
> > > > > > > >>>>>
> > > > > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1),
> > > where I
> > > > > > have
> > > > > > > a
> > > > > > > >>>> client reading ~1000 keys from an HBase table. These keys
> > > happen to
> > > > > > > fall on
> > > > > > > >>>> the same region server. For my reads I use reverse scan to
> > > read each
> > > > > > > key as
> > > > > > > >>>> I want the key prior to a specific time stamp (time stamps
> > are
> > > > > > stored
> > > > > > > in
> > > > > > > >>>> reverse order). I don't believe gets can accomplish that,
> > > right? so
> > > > > > I
> > > > > > > use
> > > > > > > >>>> scan, with caching set to 1.
> > > > > > > >>>>>
> > > > > > > >>>>> I use 2000 reader threads in the client and on HBase, I've
> > > set
> > > > > > > >>>> hbase.regionserver.handler.count to 1500. With this setup,
> > my
> > > > > > scatter
> > > > > > > >>>> gather is very slow and can take up to 10s in total. Timing
> > an
> > > > > > > individual
> > > > > > > >>>> getScanner(..) call on the client side, it can easily take
> > few
> > > > > > > hundreds of
> > > > > > > >>>> ms. I also got the following metrics from the region server
> > in
> > > > > > > question:
> > > > > > > >>>>>
> > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > > > > > >>>>>
> > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > >>>> : 89,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > >>>> : 11300,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > >>>> : 654.4949739797315,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > >>>> : 101.0,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > >>>> : 101.0,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > >>>> : 101.0,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > >>>> : 113.0,
> > > > > > > >>>>>
> > > > > > > >>>>> Where "delta" is the name of the table I am querying.
> > > > > > > >>>>>
> > > > > > > >>>>> In addition to all this, i monitored the hardware resources
> > > (CPU,
> > > > > > > disk,
> > > > > > > >>>> and network) of both the client and the region server and
> > > nothing
> > > > > > > seems
> > > > > > > >>>> anywhere near saturation. So I am puzzled by what's going on
> > > and
> > > > > > > where this
> > > > > > > >>>> time is going.
> > > > > > > >>>>>
> > > > > > > >>>>> Few things to note based on the above measurements: both
> > > medians of
> > > > > > > IPC
> > > > > > > >>>> processCallTime and queueCallTime are basically zero (ms I
> > > presume,
> > > > > > > >>>> right?). However, scanNext_median is 101 (ms too, right?). I
> > > am not
> > > > > > > sure
> > > > > > > >>>> how this adds up. Also, even though the 101 figure seems
> > > > > > outrageously
> > > > > > > high
> > > > > > > >>>> and I don't know why, still all these scans should be
> > > happening in
> > > > > > > >>>> parallel, so the overall call should finish fast, given that
> > > no
> > > > > > > hardware
> > > > > > > >>>> resource is contended, right? but this is not what's
> > > happening, so I
> > > > > > > have
> > > > > > > >>>> to be missing something(s).
> > > > > > > >>>>>
> > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > >>>>>
> > > > > > > >>>>> Thanks,
> > > > > > > >>>>> Khaled
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > >
> > >
> > >
> >
                                          
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InterruptedIOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Random;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.GnuParser;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HConnection;
import org.apache.hadoop.hbase.client.HConnectionManager;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.log4j.Logger;

import com.google.common.collect.Lists;
import com.google.common.util.concurrent.Futures;
import com.google.common.util.concurrent.ListenableFuture;
import com.google.common.util.concurrent.ListeningExecutorService;
import com.google.common.util.concurrent.MoreExecutors;

public class TestHBase {
        
//      public class Key implements Serializable{
//              private static final long serialVersionUID = 
-2396091830408879251L;
//              public int type;
//              public int val;
//              public long ts;         
//      }
        
        private static final int KEY_TYPES = 1000;
        private static final int KEY_COUNT = 1000;
        private static final int HBASE_TIMEOUT_MS = 60000;      
        private static final int HCONNECTION_COUNT = 100/*KEY_COUNT*/;
        private static final String TABLE_NAME = "table";
        private static final String COL_FAMILY_NAME = "f";
        private static final String COL_NAME = "c";

        private static Logger logger = Logger.getLogger(TestHBase.class);

        private Random rand = new Random();
        private ListeningExecutorService service = 
MoreExecutors.listeningDecorator(Executors.newFixedThreadPool(HCONNECTION_COUNT));
        private List<HConnection> hConnections = Lists.newArrayList();
        private Configuration hbaseConfig = HBaseConfiguration.create();


        private TestHBase(){
                this("localhost");
        }
        
        private TestHBase(String zkHost){
                for (int i = 0; i< HCONNECTION_COUNT; i++) {
                        try {
                                
hConnections.add(HConnectionManager.createConnection(hbaseConfig, service));
                        } catch (IOException e) {
                                throw new RuntimeException("Failed to create 
hConnection ", e);
                        }
                }
                fillTable(createTable());
        }
        /* 
         * create/open table
         */ 
         private HTable createTable() {
                HBaseAdmin hAdmin = null;
                HTable table = null;

                
                try {
                        hAdmin = new HBaseAdmin(hbaseConfig);
                        if( hAdmin.tableExists(TABLE_NAME) ){
                            if (hAdmin.isTableEnabled(TABLE_NAME)) {
                                hAdmin.disableTable(TABLE_NAME);
                            }
                            hAdmin.deleteTable(TABLE_NAME);
                        }
                        HTableDescriptor desc = new 
HTableDescriptor(TableName.valueOf(TABLE_NAME));
                        desc.addFamily(new HColumnDescriptor(COL_FAMILY_NAME));
                        hAdmin.createTable(desc);
                        table = new HTable(hbaseConfig, TABLE_NAME);
                } catch (IOException e) {
                        throw new RuntimeException("Failed to create table " + 
TABLE_NAME, e);
                } finally {
                        if (hAdmin != null)
                                try {
                                        hAdmin.close();
                                } catch (IOException e) { 
                                        throw new RuntimeException("Failed to 
close hAdmin ", e);
                                }
                }
                return table;
        }
         
        private byte[] serialize(TestKey key){
        
        try {
                ByteArrayOutputStream bas = new ByteArrayOutputStream();
            ObjectOutputStream oos = new ObjectOutputStream(bas);
                        oos.writeObject(key);
                        return bas.toByteArray();
                } catch (IOException e) {
                        e.printStackTrace();
                        throw new RuntimeException("serilization failed", e);
                }

        }
        
        private Object deserialize(byte[] buffer){
                try {
                return new ObjectInputStream(new 
ByteArrayInputStream(buffer)).readObject();
                } catch (IOException | ClassNotFoundException e) {
                        e.printStackTrace();
                        throw new RuntimeException("deserilization failed", e);
                }
        }

        private void fillTable(HTable table) {
                TestKey key = new TestKey();
                try {
                        for(int i = 0; i < 10; i++) {
                                for(int j = 0; j < KEY_TYPES; j++) {
                                        for(int k = 0; k < KEY_COUNT; k++) {
                                                /*
                                                 * Intentionally store things 
in increasing timestamp, 
                                                 * to make getting the most 
recent key require a reverse get,
                                                 * which is what we want to 
test.
                                                 */
                                                key.type = j;
                                                key.val = k;    //arbitrary 
value, chosen to be equal to the key #
                                                key.ts = System.nanoTime();     
        //10 ts per key, monotonically increasing
                                                //bogus value of byte[1]
                                                table.put(new 
Put(serialize(key))
                                                
.add(Bytes.toBytes(COL_FAMILY_NAME), Bytes.toBytes(COL_NAME), new byte[1]));
                                        }
                                }
                        }
                } catch (RetriesExhaustedWithDetailsException | 
InterruptedIOException e) {
                        throw new RuntimeException("Failed to write to table: " 
+ table.getName() + ", ", e);
                }       finally {
                        try {
                                table.close();
                        } catch (IOException e) {
                                throw new RuntimeException("Failed to close 
table", e);
                        }
                }
        }
        
        private HTableInterface getTable() {
                try {
                        return 
hConnections.get(rand.nextInt(hConnections.size()))
                                        .getTable(TABLE_NAME);
                } catch (IOException e) {
                        throw new RuntimeException("Cannot get the hbase 
table", e);
                }               
        }

        private void lookupKey(TestKey key) {
        
                HTableInterface table = getTable();
                ResultScanner scanner = null;

                try {
                        Scan scan = new Scan(serialize(key));
                        scan.setCacheBlocks(false);
                        scan.setCaching(1);
                        scan.setReversed(true);
                        scan.setSmall(true);
                        logger.debug("getting table scanner");
                        scanner = table.getScanner(scan);
                        logger.debug("got table scanner");
                        
                        Result hResult = scanner.next();
                        logger.debug("got table record");

                        if( hResult == null )                   
                                throw new RuntimeException("Empty result from 
the hbase table");

                        TestKey hkey = (TestKey) deserialize(hResult.getRow());
                        assert(key.type == hkey.type && key.val == hkey.val && 
key.ts > hkey.ts);
                        //don't care about value
                } catch (IOException e) {
                        throw new RuntimeException("Cannot communicate with the 
hbase table", e);
                } finally {
                        scanner.close();
                        try {
                                table.close();
                        } catch (IOException e) {
                                throw new RuntimeException("Cannot close the 
hbase table", e);
                        }
                        logger.debug("closed  table");
                }
        }

        private List<TestKey> generateKeys() {
                List<TestKey> keys = Lists.newArrayList();
                int type = rand.nextInt(KEY_TYPES);
                for (int i = 0; i < KEY_COUNT; i++) {
                        TestKey key = new TestKey();
                        key.type = type;
                        key.val = i;
                        key.ts = Long.MAX_VALUE;                        
                        keys.add(key);                  
                }
                return keys;            
        }

        private void multiQueryTable() {

                List<TestKey> keys = generateKeys();
                List<ListenableFuture<?>> keysCollectionFuture = new 
ArrayList<>();
                Iterator<TestKey> keysIterator = keys.iterator();
                while( keysIterator.hasNext() ){
                        final TestKey key = keysIterator.next();
                        keysCollectionFuture.add(
                                        service.submit(new Runnable() {
                                @Override
                            public void run() {
                                lookupKey(key);
                            }})
                        );
                }
                ListenableFuture<?> allKeysFuture = 
Futures.successfulAsList(keysCollectionFuture);
                try {
                        allKeysFuture.get(HBASE_TIMEOUT_MS, 
TimeUnit.MILLISECONDS);
                } catch (TimeoutException | InterruptedException | 
ExecutionException e) {
                        throw new RuntimeException("Parallel execution 
failed:", e);
                }
                
        }
        
        private void shutdown() {
                for (HConnection conn: hConnections) {
                        try {
                                conn.close();
                        } catch (IOException e) {
                                throw new RuntimeException("Failed to close 
hConnection ", e);
                        }
                }               
                service.shutdown();
        }
        
        public static void main(String[] args) {
                
                Options options = new Options();
                options.addOption( "zkHost", true, "Zookeeper's host" );
                // create the parser
            CommandLineParser parser = new GnuParser();
            CommandLine line = null;
            try {
                // parse the command line arguments
                line = parser.parse( options, args );
            }
            catch( ParseException exp ) {
                // oops, something went wrong
                System.err.println( "Parsing failed.  Reason: " + 
exp.getMessage() );
                System.exit(1);
            }

                TestHBase test;
                
                if (line.hasOption("zkHost"))
                        test = new TestHBase(line.getOptionValue("zkHost"));
                else
                        test = new TestHBase();

                for (int i = 0; i < 100; i++) {
                        long t = System.currentTimeMillis();
                        test.multiQueryTable();
                        logger.info("multi query took: " + 
(System.currentTimeMillis() - t) + "ms ");
                }               
                test.shutdown();
        }

}
import java.io.Serializable;

public class TestKey implements Serializable {
        private static final long serialVersionUID = -2396091830408879251L;
        public int type;
        public int val;
        public long ts;         
}

Reply via email to