Any chance that you can use three servers in your zookeeper quorum ? Cheers
On Mon, Nov 17, 2014 at 11:21 AM, eluiggi <[email protected]> wrote: > Hi, > > I have an hbase (0.96.1.1-cdh5.0.2) cluster on AWS managed by Cloudera with > 4 region servers and 1 zookeeper server. The zookeeper server is running on > the same node as the hbase master. The problem I'm facing is that 3/4 > region > servers are down because they can't connect to the zookeeper. The only > region server that stays up is the one running on the same node as the > master and zookeeper. Below is the relevant section of one of the failing > region server logs. > > 2014-11-14 15:46:59,871 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, connectString=ip-10-146-188-157.ec2.internal:2181 > sessionTimeout=60000 watcher=regionserver:60020, > quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase > 2014-11-14 15:46:59,915 INFO > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process > identifier=regionserver:60020 connecting to ZooKeeper > ensemble=ip-10-146-188-157.ec2.internal:2181 > 2014-11-14 15:46:59,920 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. > Will not attempt to authenticate using SASL (unknown error) > 2014-11-14 15:47:00,649 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook > thread: Shutdownhook:regionserver60020 > 2014-11-14 15:47:59,948 INFO org.apache.zookeeper.ClientCnxn: Client > session > timed out, have not heard from server in 60041ms for sessionid 0x0, closing > socket connection and attempting reconnect > 2014-11-14 15:48:00,067 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > 2014-11-14 15:48:00,072 INFO org.apache.hadoop.hbase.util.RetryCounter: > Sleeping 1000ms before retry #0... > 2014-11-14 15:48:01,067 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. > Will not attempt to authenticate using SASL (unknown error) > 2014-11-14 15:49:00,123 INFO org.apache.zookeeper.ClientCnxn: Client > session > timed out, have not heard from server in 60057ms for sessionid 0x0, closing > socket connection and attempting reconnect > 2014-11-14 15:49:00,224 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > 2014-11-14 15:49:00,224 INFO org.apache.hadoop.hbase.util.RetryCounter: > Sleeping 2000ms before retry #1... > 2014-11-14 15:49:01,224 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. > Will not attempt to authenticate using SASL (unknown error) > 2014-11-14 15:50:00,259 INFO org.apache.zookeeper.ClientCnxn: Client > session > timed out, have not heard from server in 60035ms for sessionid 0x0, closing > socket connection and attempting reconnect > 2014-11-14 15:50:00,360 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > 2014-11-14 15:50:00,360 INFO org.apache.hadoop.hbase.util.RetryCounter: > Sleeping 4000ms before retry #2... > 2014-11-14 15:50:01,360 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. > Will not attempt to authenticate using SASL (unknown error) > 2014-11-14 15:51:00,408 INFO org.apache.zookeeper.ClientCnxn: Client > session > timed out, have not heard from server in 60048ms for sessionid 0x0, closing > socket connection and attempting reconnect > 2014-11-14 15:51:00,509 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > 2014-11-14 15:51:00,509 INFO org.apache.hadoop.hbase.util.RetryCounter: > Sleeping 8000ms before retry #3... > 2014-11-14 15:51:01,509 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. > Will not attempt to authenticate using SASL (unknown error) > 2014-11-14 15:52:00,559 INFO org.apache.zookeeper.ClientCnxn: Client > session > timed out, have not heard from server in 60051ms for sessionid 0x0, closing > socket connection and attempting reconnect > 2014-11-14 15:52:00,659 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > 2014-11-14 15:52:00,660 ERROR > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists > failed after 4 attempts > 2014-11-14 15:52:00,661 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: > regionserver:60020, quorum=ip-10-146-188-157.ec2.internal:2181, > baseZNode=/hbase Unable to set watcher on znode /hbase/master > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199) > at > > org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425) > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772) > at java.lang.Thread.run(Thread.java:744) > 2014-11-14 15:52:00,687 ERROR > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020, > quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Received > unexpected KeeperException, re-throwing exception > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199) > at > > org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425) > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772) > at java.lang.Thread.run(Thread.java:744) > 2014-11-14 15:52:00,692 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > 0.0.0.0,60020,1415998019646: Unexpected exception during initialization, > aborting > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199) > at > > org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425) > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644) > at > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772) > at java.lang.Thread.run(Thread.java:744) > > The hbase-site.xml fraction dealing with zookeeper is. > <property> > <name>zookeeper.znode.parent</name> > <value>/hbase</value> > </property> > <property> > <name>zookeeper.znode.rootserver</name> > <value>root-region-server</value> > </property> > <property> > <name>hbase.zookeeper.quorum</name> > <value>ip-10-146-188-157.ec2.internal</value> > </property> > <property> > <name>hbase.zookeeper.property.clientPort</name> > <value>2181</value> > </property> > > The /etc/hosts for each of the nodes is: > 127.0.0.1 localhost.localdomain localhost > ::1 localhost6.localdomain6 localhost6 > > > Following some other threads I have removed the limit on the number of > connections, increased the timeout value, and explicitly added the hosts to > /etc/hosts on the region server and master nodes. None of these have helped > so far. > > Any help will be greatly appreciated. > > > > -- > View this message in context: > http://apache-hbase.679495.n3.nabble.com/ConnectionLossException-KeeperErrorCode-ConnectionLoss-for-hbase-master-tp4066034.html > Sent from the HBase User mailing list archive at Nabble.com. >
