Andrew thanks. I think this indeed was the problem. To get over it, I increased the amount of memory given the region server to avoid IO on reads. I used the configs below. In my experiments, I have writers streaming their output to HBase. The reader powers a web page and does this scatter/gather, where it reads 1000 keys written last and passes them the the front end. With this workload, I get the exception below at the region server. Again, I am using HBAse (0.98.6.1). Any help is appreciated. hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog gc:/tmp/hbase-regionserver-gc.log hbase-site:hbase.bucketcache.ioengine=offheap hfile.block.cache.size=0.4 hbase.bucketcache.size=4196 hbase.rs.cacheblocksonwrite=true hfile.block.index.cacheonwrite=true hfile.block.bloom.cacheonwrite=true
2014-10-10 15:06:44,173 ERROR [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:236) at org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176) at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:744) > From: apurt...@apache.org > Date: Tue, 7 Oct 2014 17:09:35 -0700 > Subject: Re: HBase read performance > To: user@hbase.apache.org > > > The cluster has 2 m1.large nodes. > > That's the problem right there. > > You need to look at c3.4xlarge or i2 instances as a minimum requirement. M1 > and even M3 instance types have ridiculously poor IO. > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote: > > > Thanks Nicolas, Qiang. > > > > I was able to write a simple program that reproduces the problem on a tiny > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One node runs the > > master, name node and zookeeper. The other node runs a data node and a > > region server, with heap size configured to be 6GB. There, the 1000 > > parallel reverse gets (reverse scans) take 7-8 seconds. The data set is > > tiny (10M records, each having a small number of bytes). As I said before, > > all hardware resources are very idle there. > > > > Interestingly, running the same workload on my macbook, the 1000 parallel > > gets take ~200ms on a pseudo-distributed installation. > > > > Any help to resolve this mystery is highly appreciated. > > > > P.S. please find my test program attached. > > > > Best, > > Khaled > > > > > From: nkey...@gmail.com > > > Date: Mon, 6 Oct 2014 09:40:48 +0200 > > > Subject: Re: HBase read performance > > > To: user@hbase.apache.org > > > > > > Hi, > > > > > > I haven't seen it mentioned, but if I understand correctly each scan > > > returns a single row? If so you should use Scan#setSmall to save some rpc > > > calls. > > > > > > Cheers, > > > > > > Nicolas > > > > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <tian...@gmail.com> wrote: > > > > > > > when using separate HConnection instance, both its > > > > RpcClient instance(maintain connection to a regionserver) and Registry > > > > instance(maintain connection to zookeeper) will be separate.. > > > > > > > > see > > > > > > > > > > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <kd...@hotmail.com> > > > > wrote: > > > > > > > > > I tried creating my own HConnections pool to use for my HBase calls, > > so > > > > > that not all the (2K) threads share the same HConnection. However, I > > > > could > > > > > only have 10 HConnections. Beyond that I get ZK exceptions, please > > find > > > > it > > > > > below. Also, with 10 HConnections, I don't see noticeable > > improvement in > > > > > performance so far. > > > > > 2014-10-05 06:11:26,490 WARN [main] zookeeper.RecoverableZooKeeper > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient > > > > > ZooKeeper, quorum=54.68.206.252:2181, > > > > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 > > > > 06:11:26,490 > > > > > INFO [main] util.RetryCounter > > > > (RetryCounter.java:sleepUntilNextRetry(155)) > > > > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN > > [main] > > > > > zookeeper.RecoverableZooKeeper > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient > > > > > ZooKeeper, quorum=54.68.206.252:2181, > > > > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 > > > > 06:11:27,849 > > > > > INFO [main] util.RetryCounter > > > > (RetryCounter.java:sleepUntilNextRetry(155)) > > > > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN > > [main] > > > > > zookeeper.RecoverableZooKeeper > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient > > > > > ZooKeeper, quorum=54.68.206.252:2181, > > > > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 > > > > 06:11:30,405 > > > > > INFO [main] util.RetryCounter > > > > (RetryCounter.java:sleepUntilNextRetry(155)) > > > > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN > > [main] > > > > > zookeeper.RecoverableZooKeeper > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient > > > > > ZooKeeper, quorum=54.68.206.252:2181, > > > > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 > > > > 06:11:35,279 > > > > > INFO [main] util.RetryCounter > > > > (RetryCounter.java:sleepUntilNextRetry(155)) > > > > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN > > [main] > > > > > zookeeper.RecoverableZooKeeper > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient > > > > > ZooKeeper, quorum=54.68.206.252:2181, > > > > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 > > > > 06:11:44,393 > > > > > ERROR [main] zookeeper.RecoverableZooKeeper > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists > > failed > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main] zookeeper.ZKUtil > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum= > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode > > > > > > > > > > > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException: > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at > > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > > at > > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > > at > > > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at > > > > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199) > > > > > at > > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479) > > > > > at > > > > > > > > > > > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65) > > > > > at > > > > > > > > > > > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83) > > > > > at > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857) > > > > > at > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662) > > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > > > > Method) at > > > > > > > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > > > > > at > > > > > > > > > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > > > > at > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414) > > > > > at > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)... > > > > > > From: kd...@hotmail.com > > > > > > To: user@hbase.apache.org > > > > > > Subject: RE: HBase read performance > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700 > > > > > > > > > > > > Lars, Ted, and Qiang, > > > > > > Thanks for all the input. > > > > > > Qiang: yes all the threads are in the same client process sharing > > the > > > > > same connection. And since I don't see hardware contention, may be > > there > > > > is > > > > > contention over this code path. I'll try using many connections and > > see > > > > if > > > > > it alleviates the problems and I'll report back. > > > > > > Thanks again,Khaled > > > > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800 > > > > > > > Subject: Re: HBase read performance > > > > > > > From: tian...@gmail.com > > > > > > > To: user@hbase.apache.org > > > > > > > > > > > > > > Regarding to profiling, Andrew introduced > > > > > > > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html > > > > > months > > > > > > > ago. > > > > > > > > > > > > > > processCallTime comes from RpcServer#call, so it looks good? > > > > > > > > > > > > > > I have a suspect: > > https://issues.apache.org/jira/browse/HBASE-11306 > > > > > > > > > > > > > > how many processes do you have for your 2000 threads? > > > > > > > if olny 1 process, those threads will share just 1 connection to > > that > > > > > > > regionserver, there might be big contention on the RPC code path. > > > > > ---for > > > > > > > such case, could you try using different connections? > > > > > > > > > > > > > > > > > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yuzhih...@gmail.com> > > wrote: > > > > > > > > > > > > > > > Khaled: > > > > > > > > Do you have profiler such as jprofiler ? > > > > > > > > Profiling would give us more hint. > > > > > > > > > > > > > > > > Otherwise capturing stack trace during the period of reverse > > scan > > > > > would > > > > > > > > help. > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl < > > la...@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > > > You might have the data in the OS buffer cache, without short > > > > > circuit > > > > > > > > > reading the region server has to request the block from the > > data > > > > > node > > > > > > > > > process, which then reads it from the block cache. > > > > > > > > > That is a few context switches per RPC that do not show up > > in CPU > > > > > > > > metrics. > > > > > > > > > In that you also would not see disk IO. > > > > > > > > > > > > > > > > > > > > > > > > > > > If - as you say - you see a lot of evicted blocks the data > > *has* > > > > > to come > > > > > > > > > from the OS. If you do not see disk IO is *has* to come from > > the > > > > OS > > > > > > > > cache. > > > > > > > > > I.e. there's more RAM on your boxes, and you should increase > > the > > > > > heap > > > > > > > > block > > > > > > > > > cache. > > > > > > > > > > > > > > > > > > > > > > > > > > > You can measure the context switches with vmstat. Other than > > that > > > > > I have > > > > > > > > > no suggestion until I reproduce the problem. > > > > > > > > > Also check the data locality index of the region server it > > should > > > > > be > > > > > > > > close > > > > > > > > > to 100%. > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com> > > > > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org> > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM > > > > > > > > > Subject: RE: HBase read performance > > > > > > > > > > > > > > > > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll make sure I cover > > all > > > > > of them > > > > > > > > > and get back to you. I am not sure they are the bottleneck > > though > > > > > as they > > > > > > > > > all are about optimizing physical resource usage. As I said, > > I > > > > > don't see > > > > > > > > > any contended physical resources now. I'll also try to > > reproduce > > > > > this > > > > > > > > > problem in a simpler environment and pass to you the test > > program > > > > > to play > > > > > > > > > with. > > > > > > > > > > > > > > > > > > > > > > > > > > > Couple of high level points to make. You are right that my > > use > > > > > case is > > > > > > > > > kind of a worst case for HBase reads. But, if things go the > > way > > > > you > > > > > > > > > described them, there should be tons of disk IO and that > > should > > > > be > > > > > > > > clearly > > > > > > > > > the bottleneck. This is not the case though. That's for the > > > > simple > > > > > reason > > > > > > > > > that this is done in a test environment (I am still > > prototyping), > > > > > and > > > > > > > > not a > > > > > > > > > lot of data is yet written to HBase. However for the real use > > > > > case, there > > > > > > > > > should writers constantly writing data to HBase and readers > > > > > occasionally > > > > > > > > > doing this scatter/gather. At steady state, things should > > only > > > > get > > > > > worse > > > > > > > > > and all the issues you mentioned should get far more > > pronounced. > > > > > At this > > > > > > > > > point, one can try to mitigate it using more memory or so. I > > am > > > > > not there > > > > > > > > > yet as I think I am hitting some software bottleneck, which I > > > > > don't know > > > > > > > > > how to work around. > > > > > > > > > > > > > > > > > > Khaled > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ---------------------------------------- > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700 > > > > > > > > > > From: la...@apache.org > > > > > > > > > > Subject: Re: HBase read performance > > > > > > > > > > To: user@hbase.apache.org > > > > > > > > > > > > > > > > > > > > OK... We might need to investigate this. > > > > > > > > > > Any chance that you can provide a minimal test program and > > > > > instruction > > > > > > > > > about how to set it up. > > > > > > > > > > We can do some profiling then. > > > > > > > > > > > > > > > > > > > > One thing to note is that with scanning HBase cannot use > > bloom > > > > > filters > > > > > > > > > to rule out HFiles ahead of time, it needs to look into all > > of > > > > > them. > > > > > > > > > > So you kind of hit on the absolute worst case: > > > > > > > > > > - random reads that do not fit into the block cache > > > > > > > > > > - cannot use bloom filters > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Few more question/comments: > > > > > > > > > > - Do you have short circuit reading enabled? If not, you > > > > should. > > > > > > > > > > - Is your table major compacted? That will reduce the > > number of > > > > > files > > > > > > > > to > > > > > > > > > look at. > > > > > > > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? > > It > > > > > disabled > > > > > > > > > by default in HBase, but necessarily in your install of HDFS. > > > > > > > > > > - Which version of HDFS are you using as backing > > filesystem? > > > > > > > > > > - If your disk is idle, it means the data fits into the OS > > > > buffer > > > > > > > > cache. > > > > > > > > > In turn that means that you increase the heap for the region > > > > > servers. You > > > > > > > > > can also use block encoding (FAST_DIFF) to try to make sure > > the > > > > > entire > > > > > > > > > working set fits into the cache. > > > > > > > > > > > > > > > > > > > > - Also try to reduce the block size - although if your > > overall > > > > > working > > > > > > > > > set does not fit in the heap it won't help much. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is a good section of the book to read through > > generally > > > > > (even > > > > > > > > > though you might know most of this already): > > > > > > > > > http://hbase.apache.org/book.html#perf.configurations > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com> > > > > > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org> > > > > > > > > > > Cc: > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM > > > > > > > > > > Subject: RE: HBase read performance > > > > > > > > > > > > > > > > > > > > I do see a very brief spike in CPU (user/system), but it's > > no > > > > > where > > > > > > > > near > > > > > > > > > 0% idle. It goes from 99% idle down to something like 40% > > idle > > > > for > > > > > a > > > > > > > > second > > > > > > > > > or so. The thing to note, this is all on a test cluster, so > > no > > > > > real load. > > > > > > > > > Things are generally idle until i issue 2-3 of these > > > > > multi-scan-requests > > > > > > > > to > > > > > > > > > render a web page. Then, you see the spike in the cpu and > > some > > > > > activity > > > > > > > > in > > > > > > > > > the network and disk, but nowhere near saturation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If there are specific tests you'd like me to do to debug > > this, > > > > > I'd be > > > > > > > > > more than happy to do it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Khaled > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ---------------------------------------- > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700 > > > > > > > > > >> From: la...@apache.org > > > > > > > > > >> Subject: Re: HBase read performance > > > > > > > > > >> To: user@hbase.apache.org > > > > > > > > > >> > > > > > > > > > >> I still think you're waiting on disk. No IOWAIT? So CPU > > is not > > > > > waiting > > > > > > > > > a lot for IO. No high User/System CPU either? > > > > > > > > > >> > > > > > > > > > >> If you see a lot of evicted block then each RPC has a high > > > > > chance of > > > > > > > > > requiring to bring an entire 64k block in. You'll see bad > > > > > performance > > > > > > > > with > > > > > > > > > this. > > > > > > > > > >> > > > > > > > > > >> We might need to trace this specific scenario. > > > > > > > > > >> > > > > > > > > > >> -- Lars > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> ________________________________ > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com> > > > > > > > > > >> To: "user@hbase.apache.org" <user@hbase.apache.org> > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM > > > > > > > > > >> Subject: RE: HBase read performance > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> I've set the heap size to 6GB and I do have gc logging. No > > > > long > > > > > pauses > > > > > > > > > there -- occasional 0.1s or 0.2s. > > > > > > > > > >> > > > > > > > > > >> Other than the discrepancy between what's reported on the > > > > > client and > > > > > > > > > what's reported at the RS, there is also the issue of not > > getting > > > > > proper > > > > > > > > > concurrency. So, even if a reverse get takes 100ms or so > > (this > > > > has > > > > > to be > > > > > > > > > mostly blocking on various things as no physical resource is > > > > > contended), > > > > > > > > > then the other gets/scans should be able to proceed in > > parallel, > > > > > so a > > > > > > > > > thousand concurrent gets/scans should finish in few hundreds > > of > > > > ms > > > > > not > > > > > > > > many > > > > > > > > > seconds. That's why I thought I'd increase the handlers > > count to > > > > > try to > > > > > > > > get > > > > > > > > > more concurrency, but it didn't help. So, there must be > > something > > > > > else. > > > > > > > > > >> > > > > > > > > > >> Khaled > > > > > > > > > >> > > > > > > > > > >> ---------------------------------------- > > > > > > > > > >>> From: ndimi...@gmail.com > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700 > > > > > > > > > >>> Subject: Re: HBase read performance > > > > > > > > > >>> To: user@hbase.apache.org > > > > > > > > > >>> > > > > > > > > > >>> Do check again on the heap size of the region servers. > > The > > > > > default > > > > > > > > > >>> unconfigured size is 1G; too small for much of anything. > > > > Check > > > > > your > > > > > > > > RS > > > > > > > > > logs > > > > > > > > > >>> -- look for lines produced by the JVMPauseMonitor thread. > > > > They > > > > > > > > usually > > > > > > > > > >>> correlate with long GC pauses or other process-freeze > > events. > > > > > > > > > >>> > > > > > > > > > >>> Get is implemented as a Scan of a single row, so a > > reverse > > > > > scan of a > > > > > > > > > single > > > > > > > > > >>> row should be functionally equivalent. > > > > > > > > > >>> > > > > > > > > > >>> In practice, I have seen discrepancy between the > > latencies > > > > > reported > > > > > > > > by > > > > > > > > > the > > > > > > > > > >>> RS and the latencies experienced by the client. I've not > > > > > investigated > > > > > > > > > this > > > > > > > > > >>> area thoroughly. > > > > > > > > > >>> > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy < > > > > > kd...@hotmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > >>> > > > > > > > > > >>>> Thanks Lars for your quick reply. > > > > > > > > > >>>> > > > > > > > > > >>>> Yes performance is similar with less handlers (I tried > > with > > > > > 100 > > > > > > > > > first). > > > > > > > > > >>>> > > > > > > > > > >>>> The payload is not big ~1KB or so. The working set > > doesn't > > > > > seem to > > > > > > > > > fit in > > > > > > > > > >>>> memory as there are many cache misses. However, disk is > > far > > > > > from > > > > > > > > > being a > > > > > > > > > >>>> bottleneck. I checked using iostat. I also verified that > > > > > neither the > > > > > > > > > >>>> network nor the CPU of the region server or the client > > are a > > > > > > > > > bottleneck. > > > > > > > > > >>>> This leads me to believe that likely this is a software > > > > > bottleneck, > > > > > > > > > >>>> possibly due to a misconfiguration on my side. I just > > don't > > > > > know how > > > > > > > > > to > > > > > > > > > >>>> debug it. A clear disconnect I see is the individual > > request > > > > > latency > > > > > > > > > as > > > > > > > > > >>>> reported by metrics on the region server (IPC > > > > processCallTime > > > > > vs > > > > > > > > > scanNext) > > > > > > > > > >>>> vs what's measured on the client. Does this sound > > right? Any > > > > > ideas > > > > > > > > on > > > > > > > > > how > > > > > > > > > >>>> to better debug it? > > > > > > > > > >>>> > > > > > > > > > >>>> About this trick with the timestamps to be able to do a > > > > > forward > > > > > > > > scan, > > > > > > > > > >>>> thanks for pointing it out. Actually, I am aware of it. > > The > > > > > problem > > > > > > > > I > > > > > > > > > have > > > > > > > > > >>>> is, sometimes I want to get the key after a particular > > > > > timestamp and > > > > > > > > > >>>> sometimes I want to get the key before, so just relying > > on > > > > > the key > > > > > > > > > order > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought > > > > > reverse > > > > > > > > scan > > > > > > > > > can > > > > > > > > > >>>> do the trick though. > > > > > > > > > >>>> > > > > > > > > > >>>> Khaled > > > > > > > > > >>>> > > > > > > > > > >>>> ---------------------------------------- > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700 > > > > > > > > > >>>>> From: la...@apache.org > > > > > > > > > >>>>> Subject: Re: HBase read performance > > > > > > > > > >>>>> To: user@hbase.apache.org > > > > > > > > > >>>>> > > > > > > > > > >>>>> Hi Khaled, > > > > > > > > > >>>>> is it the same with fewer threads? 1500 handler threads > > > > > seems to > > > > > > > > be a > > > > > > > > > >>>> lot. Typically a good number of threads depends on the > > > > > hardware > > > > > > > > > (number of > > > > > > > > > >>>> cores, number of spindles, etc). I cannot think of any > > type > > > > of > > > > > > > > > scenario > > > > > > > > > >>>> where more than 100 would give any improvement. > > > > > > > > > >>>>> > > > > > > > > > >>>>> How large is the payload per KV retrieved that way? If > > > > large > > > > > (as > > > > > > > > in a > > > > > > > > > >>>> few 100k) you definitely want to lower the number of the > > > > > handler > > > > > > > > > threads. > > > > > > > > > >>>>> How much heap do you give the region server? Does the > > > > > working set > > > > > > > > fit > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you see the > > > > eviction > > > > > count > > > > > > > > > going > > > > > > > > > >>>> up, if so it does not fit into the cache). > > > > > > > > > >>>>> > > > > > > > > > >>>>> If the working set does not fit into the cache > > (eviction > > > > > count goes > > > > > > > > > up) > > > > > > > > > >>>> then HBase will need to bring a new block in from disk > > on > > > > > each Get > > > > > > > > > >>>> (assuming the Gets are more or less random as far as the > > > > > server is > > > > > > > > > >>>> concerned). > > > > > > > > > >>>>> In case you'll benefit from reducing the HFile block > > size > > > > > (from 64k > > > > > > > > > to > > > > > > > > > >>>> 8k or even 4k). > > > > > > > > > >>>>> > > > > > > > > > >>>>> Lastly I don't think we tested the performance of using > > > > > reverse > > > > > > > > scan > > > > > > > > > >>>> this way, there is probably room to optimize this. > > > > > > > > > >>>>> Can you restructure your keys to allow forwards > > scanning? > > > > For > > > > > > > > example > > > > > > > > > >>>> you could store the time as MAX_LONG-time. Or you could > > > > > invert all > > > > > > > > > the bits > > > > > > > > > >>>> of the time portion of the key, so that it sort the > > other > > > > > way. Then > > > > > > > > > you > > > > > > > > > >>>> could do a forward scan. > > > > > > > > > >>>>> > > > > > > > > > >>>>> Let us know how it goes. > > > > > > > > > >>>>> > > > > > > > > > >>>>> -- Lars > > > > > > > > > >>>>> > > > > > > > > > >>>>> > > > > > > > > > >>>>> ----- Original Message ----- > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com> > > > > > > > > > >>>>> To: "user@hbase.apache.org" <user@hbase.apache.org> > > > > > > > > > >>>>> Cc: > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM > > > > > > > > > >>>>> Subject: HBase read performance > > > > > > > > > >>>>> > > > > > > > > > >>>>> Hi, > > > > > > > > > >>>>> > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), > > > > > where I > > > > > > > > have > > > > > > > > > a > > > > > > > > > >>>> client reading ~1000 keys from an HBase table. These > > keys > > > > > happen to > > > > > > > > > fall on > > > > > > > > > >>>> the same region server. For my reads I use reverse scan > > to > > > > > read each > > > > > > > > > key as > > > > > > > > > >>>> I want the key prior to a specific time stamp (time > > stamps > > > > are > > > > > > > > stored > > > > > > > > > in > > > > > > > > > >>>> reverse order). I don't believe gets can accomplish > > that, > > > > > right? so > > > > > > > > I > > > > > > > > > use > > > > > > > > > >>>> scan, with caching set to 1. > > > > > > > > > >>>>> > > > > > > > > > >>>>> I use 2000 reader threads in the client and on HBase, > > I've > > > > > set > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500. With this > > setup, > > > > my > > > > > > > > scatter > > > > > > > > > >>>> gather is very slow and can take up to 10s in total. > > Timing > > > > an > > > > > > > > > individual > > > > > > > > > >>>> getScanner(..) call on the client side, it can easily > > take > > > > few > > > > > > > > > hundreds of > > > > > > > > > >>>> ms. I also got the following metrics from the region > > server > > > > in > > > > > > > > > question: > > > > > > > > > >>>>> > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637, > > > > > > > > > >>>>> "queueCallTime_median" : 0.0, > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0, > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0, > > > > > > > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818, > > > > > > > > > >>>>> > > > > > > > > > >>>>> "processCallTime_min" : 0, > > > > > > > > > >>>>> "processCallTime_max" : 12755, > > > > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682, > > > > > > > > > >>>>> "processCallTime_median" : 0.0, > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0, > > > > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95, > > > > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89, > > > > > > > > > >>>>> > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" > > > > > > > > > >>>> : 89, > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" > > > > > > > > > >>>> : 11300, > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" > > > > > > > > > >>>> : 654.4949739797315, > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" > > > > > > > > > >>>> : 101.0, > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" > > > > > > > > > >>>> : 101.0, > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" > > > > > > > > > >>>> : 101.0, > > > > > > > > > >>>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" > > > > > > > > > >>>> : 113.0, > > > > > > > > > >>>>> > > > > > > > > > >>>>> Where "delta" is the name of the table I am querying. > > > > > > > > > >>>>> > > > > > > > > > >>>>> In addition to all this, i monitored the hardware > > resources > > > > > (CPU, > > > > > > > > > disk, > > > > > > > > > >>>> and network) of both the client and the region server > > and > > > > > nothing > > > > > > > > > seems > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by what's > > going on > > > > > and > > > > > > > > > where this > > > > > > > > > >>>> time is going. > > > > > > > > > >>>>> > > > > > > > > > >>>>> Few things to note based on the above measurements: > > both > > > > > medians of > > > > > > > > > IPC > > > > > > > > > >>>> processCallTime and queueCallTime are basically zero > > (ms I > > > > > presume, > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms too, > > right?). I > > > > > am not > > > > > > > > > sure > > > > > > > > > >>>> how this adds up. Also, even though the 101 figure seems > > > > > > > > outrageously > > > > > > > > > high > > > > > > > > > >>>> and I don't know why, still all these scans should be > > > > > happening in > > > > > > > > > >>>> parallel, so the overall call should finish fast, given > > that > > > > > no > > > > > > > > > hardware > > > > > > > > > >>>> resource is contended, right? but this is not what's > > > > > happening, so I > > > > > > > > > have > > > > > > > > > >>>> to be missing something(s). > > > > > > > > > >>>>> > > > > > > > > > >>>>> So, any help is appreciated there. > > > > > > > > > >>>>> > > > > > > > > > >>>>> Thanks, > > > > > > > > > >>>>> Khaled > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White)