sorry I did not get chance to look at the HDFS issue. but a quick search show many hits for the HDFS error message, looks a common issue. you could search it and check those suggestions((e.g. increase memory..). or ask it directly in HDFS user list :-)
On Tue, Dec 2, 2014 at 9:33 PM, Robert Kent <[email protected]> wrote: > Sending without an attachment, as the attachment seems to be > delaying/stopping the message hitting the mailing list. > > ________________________________________ > From: Robert Kent > Sent: 02 December 2014 12:40 > To: [email protected] > Subject: RE: HBase Regionserver randomly dies > > > which hbase release are you using? > > Hadoop: 2.5.0 > HBase: 0.98.6 > Zookeeper: 3.4.5 > > > your regionserver log is helpful. > > I've attached all the logs for the 29th November - the logs are 148M > uncompressed. > I've also attached the Zookeeper, Hadoop & HBase configurations > > Is there any other information I can give you to help? > > > ps you mentioned "The clusters are either single node..." what about > your hdfs nodes? > > I have a handful of clusters: > > 1 node cluster: 1x VM running: HBase Regionserver & Master; Hadoop > NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper > 3 node cluster: 3x VM all/mostly-all running: HBase Regionserver & Master; > Hadoop NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper > > HBase is running on top of HDFS in both cluster types. > > The single node cluster runs everything on itself. > The three node cluster runs virtually everything on every node > _______________________________________ > From: Qiang Tian [[email protected]] > Sent: 02 December 2014 01:40 > To: [email protected] > Subject: Re: HBase Regionserver randomly dies > > which hbase release are you using? > > your regionserver log is helpful. > > a related case is https://issues.apache.org/jira/browse/HBASE-11902 > > in 0.98 the RS abort is expected behavior when getting HDFS failure. you > still need to find the root cause of the hdfs failure: > > could only be replicat > ed to 0 nodes instead of minReplication (=1). There are 1 datanode(s) > running and no node(s) are excluded in this operation. > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager. > chooseTarget(BlockManager.java:1492) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem. > getAdditionalBlock(FSNamesystem.java:3027) > > > ps you mentioned "The clusters are either single node..." what about your > hdfs nodes? > > > On Tue, Dec 2, 2014 at 4:15 AM, Ted Yu <[email protected]> wrote: > > > There could be multiple reasons why the single datanode became considered > > as dead. > > e.g. datanode went under load which it couldn't handle. > > > > I would recommend adding more datanode(s) so that client (hbase) can ride > > over (slow) datanode. > > > > Cheers > > > > On Mon, Dec 1, 2014 at 8:21 AM, Robert Kent <[email protected]> > > wrote: > > > > > Sorry, those logs were from the Regionserver. > > > > > > The NameNode logs are: > > > > > > 2014-11-29 21:12:59,493 WARN [IPC Server handler 0 on 8020] > > > blockmanagement.BlockPlacementPolicy > > > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place > > > enough replicas, still in need of 1 to reach 1. For more information, > > > please enable DEBUG log level on > > > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy > > > 2014-11-29 21:12:59,494 INFO [IPC Server handler 0 on 8020] ipc.Server > > > (Server.java:run(2034)) - IPC Server handler 0 on 8020, call > > > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from > > > 127.0.0.1:39965 Call#382010 Retry#0 > > > java.io.IOException: File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > 2014-11-29 21:12:59,742 WARN [IPC Server handler 5 on 8020] > > > blockmanagement.BlockPlacementPolicy > > > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place > > > enough replicas, still in need of 1 to reach 1. For more information, > > > please enable DEBUG log level on > > > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy > > > 2014-11-29 21:12:59,742 INFO [IPC Server handler 5 on 8020] ipc.Server > > > (Server.java:run(2034)) - IPC Server handler 5 on 8020, call > > > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from > > > 127.0.0.1:39965 Call#382017 Retry#0 > > > java.io.IOException: File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > at > > > > > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492) > > > [snip] > > > > > > Then after 2014-11-29 21:13:11 there are no further exceptions. > > > > > > > > > After having the Regionserver abort, I did not touch Hadoop. All I did > > > was restart the Regionserver and everything started working correctly > > again. > > > ________________________________________ > > > From: Robert Kent > > > Sent: 01 December 2014 16:17 > > > To: [email protected] > > > Subject: RE: HBase Regionserver randomly dies > > > > > > > From: Ted Yu [[email protected]] > > > > Sent: 01 December 2014 15:31 > > > > To: [email protected] > > > > Subject: Re: HBase Regionserver randomly dies > > > > > > > Can you check namenode log around the time 'Failed to close inode' > > error > > > > was thrown ? > > > > > > > Thanks > > > > > > Here are the errors from the logs: > > > > > > 2014-11-29 21:12:59,277 WARN [Thread-125058] hdfs.DFSClient > > > (DFSOutputStream.java:run(639)) - DataStreamer Exception > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicat > > > ed to 0 nodes instead of minReplication (=1). There are 1 datanode(s) > > > running and no node(s) are excluded in this operation. > > > at > > > > > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492) > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027) > > > [snip] > > > 2014-11-29 21:12:59,306 WARN [regionserver60020.logRoller] > > hdfs.DFSClient > > > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,359 WARN [regionserver60020.logRoller] wal.FSHLog > > > (FSHLog.java:rollWriter(566)) - pre-sync failed > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,459 INFO [regionserver60020.logRoller] wal.FSHLog > > > (FSHLog.java:rollWriter(588)) - Rolled WAL > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389 > > > with entries=23544, filesize=121.9 M; new WAL > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > 2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter] > > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter] > > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0] > > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1] > > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter] > > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0] > > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,524 WARN [Thread-125063] hdfs.DFSClient > > > (DFSOutputStream.java:run(639)) - DataStreamer Exception > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,525 WARN [regionserver60020.logRoller] > > hdfs.DFSClient > > > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > [snip] > > > 2014-11-29 21:12:59,538 WARN [regionserver60020.logRoller] wal.FSHLog > > > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close > failure! > > > error count=1 > > > 2014-11-29 21:12:59,539 INFO [regionserver60020.logRoller] wal.FSHLog > > > (FSHLog.java:rollWriter(588)) - Rolled WAL > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898 > > > with entries=5, filesize=0; new WAL > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > [snip] > > > 2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter] > > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,743 WARN [Thread-125064] hdfs.DFSClient > > > (DFSOutputStream.java:run(639)) - DataStreamer Exception > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] > > > wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - > Got > > > IOException while writing trailer > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog > > > (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > [snip] > > > 2014-11-29 21:12:59,746 WARN [regionserver60020.logRoller] wal.FSHLog > > > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close > failure! > > > error count=2 > > > 2014-11-29 21:12:59,748 INFO [regionserver60020.logRoller] wal.FSHLog > > > (FSHLog.java:rollWriter(588)) - Rolled WAL > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489 > > > with entries=5, filesize=0; new WAL > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 > > > 2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter] > > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write, > > request > > > close of hlog > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > at > > > > > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492) > > > [snip] > > > 2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller] > > > regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING > > > region server extras1.ci.local,60020,1417171049368: Failed log close in > > log > > > roller > > > org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: > > > #1417295579737 > > > [snip] > > > 2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller] > > > regionserver.HRegionServer (HRegionServer.java:abort(1873)) - > > RegionServer > > > abort: loaded coprocessors are: > > > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > > > 2014-11-29 21:13:00,522 INFO [regionserver60020.logRoller] > > > regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED: > > > Failed log close in log roller > > > [snip] > > > 2014-11-29 21:13:02,132 INFO [regionserver60020] regionserver.Leases > > > (Leases.java:close(147)) - regionserver60020 closing leases > > > 2014-11-29 21:13:02,132 INFO [regionserver60020] regionserver.Leases > > > (Leases.java:close(150)) - regionserver60020 closed leases > > > 2014-11-29 21:13:02,179 INFO [regionserver60020] > > > regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) > - > > > Closing source Indexer_mhsaudit because: Region server is closing > > > 2014-11-29 21:13:02,179 INFO [regionserver60020] > > > client.HConnectionManager$HConnectionImplementation > > > (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing > zookeeper > > > sessionid=0x149f0f13a300434 > > > 2014-11-29 21:13:02,180 INFO [regionserver60020] zookeeper.ZooKeeper > > > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed > > > 2014-11-29 21:13:02,272 INFO [regionserver60020-EventThread] > > > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down > > > 2014-11-29 21:13:02,356 INFO [regionserver60020] zookeeper.ZooKeeper > > > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed > > > 2014-11-29 21:13:02,356 INFO [regionserver60020-EventThread] > > > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down > > > 2014-11-29 21:13:02,356 INFO [regionserver60020] > > > regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping > > server > > > extras1.ci.local,60020,1417171049368; zookeeper connection closed. > > > 2014-11-29 21:13:02,417 INFO [regionserver60020] > > > regionserver.HRegionServer (HRegionServer.java:run(1061)) - > > > regionserver60020 exiting > > > 2014-11-29 21:13:02,494 ERROR [main] > > regionserver.HRegionServerCommandLine > > > (HRegionServerCommandLine.java:start(70)) - Region server exiting > > > java.lang.RuntimeException: HRegionServer Aborted > > > at > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66) > > > at > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85) > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > > at > > > > > > org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) > > > at > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467) > > > 2014-11-29 21:13:02,715 INFO [Thread-10] regionserver.ShutdownHook > > > (ShutdownHook.java:run(111)) - Shutdown hook starting; > > > hbase.shutdown.hook=true; > > > > > > fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546 > > > 2014-11-29 21:13:02,716 INFO [Thread-10] regionserver.ShutdownHook > > > (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread. > > > 2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient > > > (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode > > > 32621 > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753 > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > are 1 datanode(s) running and no node(s) are excluded in this > operation. > > > > > > > > > On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <[email protected]> > > > wrote: > > > > > Looks like an HDFS issue. Are you sure your HDFS is working fine? > > > > > > > > HDFS appears to be working correctly - HBase will process requests > > > > properly and everything appears to work correctly for hours/days, > until > > > the > > > > regionserver randomly falls over. If there were HDFS issues I would > > > expect > > > > to see these during normal operation, but I don't. > > > > > > > > > >
