Re: HBase Regionserver randomly dies

Qiang Tian Wed, 03 Dec 2014 04:09:59 -0800

sorry I did not get chance to look at the HDFS issue. but a quick search
show many hits for the HDFS error message, looks a common issue. you could
search it and check those suggestions((e.g. increase memory..).  or ask it
directly in HDFS user list :-)



On Tue, Dec 2, 2014 at 9:33 PM, Robert Kent <[email protected]> wrote:

> Sending without an attachment, as the attachment seems to be
> delaying/stopping the message hitting the mailing list.
>
> ________________________________________
> From: Robert Kent
> Sent: 02 December 2014 12:40
> To: [email protected]
> Subject: RE: HBase Regionserver randomly dies
>
> > which hbase release are you using?
>
> Hadoop: 2.5.0
> HBase: 0.98.6
> Zookeeper: 3.4.5
>
> > your regionserver log is helpful.
>
> I've attached all the logs for the 29th November - the logs are 148M
> uncompressed.
> I've also attached the Zookeeper, Hadoop & HBase configurations
>
> Is there any other information I can give you to help?
>
> > ps you mentioned "The clusters are either single node..." what about
> your hdfs nodes?
>
> I have a handful of clusters:
>
> 1 node cluster: 1x VM running: HBase Regionserver & Master; Hadoop
> NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper
> 3 node cluster: 3x VM all/mostly-all running: HBase Regionserver & Master;
> Hadoop NameNode, DataNode, JobHistoryServer, NodeManager; Zookeeper
>
> HBase is running on top of HDFS in both cluster types.
>
> The single node cluster runs everything on itself.
> The three node cluster runs virtually everything on every node
> _______________________________________
> From: Qiang Tian [[email protected]]
> Sent: 02 December 2014 01:40
> To: [email protected]
> Subject: Re: HBase Regionserver randomly dies
>
> which hbase release are you using?
>
> your regionserver log is helpful.
>
> a related case is https://issues.apache.org/jira/browse/HBASE-11902
>
> in 0.98 the RS abort is expected behavior when getting HDFS failure.  you
> still need to find the root cause of the hdfs failure:
>
> could only be replicat
> ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> running and no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.
> chooseTarget(BlockManager.java:1492)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> getAdditionalBlock(FSNamesystem.java:3027)
>
>
> ps you mentioned "The clusters are either single node..." what about your
> hdfs nodes?
>
>
> On Tue, Dec 2, 2014 at 4:15 AM, Ted Yu <[email protected]> wrote:
>
> > There could be multiple reasons why the single datanode became considered
> > as dead.
> > e.g. datanode went under load which it couldn't handle.
> >
> > I would recommend adding more datanode(s) so that client (hbase) can ride
> > over (slow) datanode.
> >
> > Cheers
> >
> > On Mon, Dec 1, 2014 at 8:21 AM, Robert Kent <[email protected]>
> > wrote:
> >
> > > Sorry, those logs were from the Regionserver.
> > >
> > > The NameNode logs are:
> > >
> > > 2014-11-29 21:12:59,493 WARN  [IPC Server handler 0 on 8020]
> > > blockmanagement.BlockPlacementPolicy
> > > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > > enough replicas, still in need of 1 to reach 1. For more information,
> > > please enable DEBUG log level on
> > > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > > 2014-11-29 21:12:59,494 INFO  [IPC Server handler 0 on 8020] ipc.Server
> > > (Server.java:run(2034)) - IPC Server handler 0 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > > 127.0.0.1:39965 Call#382010 Retry#0
> > > java.io.IOException: File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > 2014-11-29 21:12:59,742 WARN  [IPC Server handler 5 on 8020]
> > > blockmanagement.BlockPlacementPolicy
> > > (BlockPlacementPolicyDefault.java:chooseTarget(313)) - Failed to place
> > > enough replicas, still in need of 1 to reach 1. For more information,
> > > please enable DEBUG log level on
> > > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> > > 2014-11-29 21:12:59,742 INFO  [IPC Server handler 5 on 8020] ipc.Server
> > > (Server.java:run(2034)) - IPC Server handler 5 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > > 127.0.0.1:39965 Call#382017 Retry#0
> > > java.io.IOException: File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > > [snip]
> > >
> > > Then after 2014-11-29 21:13:11 there are no further exceptions.
> > >
> > >
> > > After having the Regionserver abort, I did not touch Hadoop.  All I did
> > > was restart the Regionserver and everything started working correctly
> > again.
> > > ________________________________________
> > > From: Robert Kent
> > > Sent: 01 December 2014 16:17
> > > To: [email protected]
> > > Subject: RE: HBase Regionserver randomly dies
> > >
> > > > From: Ted Yu [[email protected]]
> > > > Sent: 01 December 2014 15:31
> > > > To: [email protected]
> > > > Subject: Re: HBase Regionserver randomly dies
> > >
> > > > Can you check namenode log around the time 'Failed to close inode'
> > error
> > > > was thrown ?
> > >
> > > > Thanks
> > >
> > > Here are the errors from the logs:
> > >
> > > 2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient
> > > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicat
> > > ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
> > > running and no node(s) are excluded in this operation.
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
> > > [snip]
> > > 2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller]
> > hdfs.DFSClient
> > > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(566)) - pre-sync failed
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(588)) - Rolled WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
> > > with entries=23544, filesize=121.9 M; new WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > 2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1]
> > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0]
> > > wal.FSHLog (FSHLog.java:run(1255)) - Error while AsyncSyncer sync,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient
> > > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller]
> > hdfs.DFSClient
> > > (DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > [snip]
> > > 2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close
> failure!
> > > error count=1
> > > 2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(588)) - Rolled WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
> > > with entries=5, filesize=0; new WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > [snip]
> > > 2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient
> > > (DFSOutputStream.java:run(639)) - DataStreamer Exception
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller]
> > > wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) -
> Got
> > > IOException while writing trailer
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > > [snip]
> > > 2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close
> failure!
> > > error count=2
> > > 2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog
> > > (FSHLog.java:rollWriter(588)) - Rolled WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
> > > with entries=5, filesize=0; new WAL
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > 2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter]
> > > wal.FSHLog (FSHLog.java:run(1140)) - Error while AsyncWriter write,
> > request
> > > close of hlog
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
> > > [snip]
> > > 2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller]
> > > regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING
> > > region server extras1.ci.local,60020,1417171049368: Failed log close in
> > log
> > > roller
> > > org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
> > > #1417295579737
> > > [snip]
> > > 2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller]
> > > regionserver.HRegionServer (HRegionServer.java:abort(1873)) -
> > RegionServer
> > > abort: loaded coprocessors are:
> > > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > > 2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller]
> > > regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED:
> > > Failed log close in log roller
> > > [snip]
> > > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > > (Leases.java:close(147)) - regionserver60020 closing leases
> > > 2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases
> > > (Leases.java:close(150)) - regionserver60020 closed leases
> > > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > > regionserver.ReplicationSource (ReplicationSource.java:terminate(860))
> -
> > > Closing source Indexer_mhsaudit because: Region server is closing
> > > 2014-11-29 21:13:02,179 INFO  [regionserver60020]
> > > client.HConnectionManager$HConnectionImplementation
> > > (HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing
> zookeeper
> > > sessionid=0x149f0f13a300434
> > > 2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper
> > > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
> > > 2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread]
> > > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > > 2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper
> > > (ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
> > > 2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread]
> > > zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
> > > 2014-11-29 21:13:02,356 INFO  [regionserver60020]
> > > regionserver.HRegionServer (HRegionServer.java:run(1058)) - stopping
> > server
> > > extras1.ci.local,60020,1417171049368; zookeeper connection closed.
> > > 2014-11-29 21:13:02,417 INFO  [regionserver60020]
> > > regionserver.HRegionServer (HRegionServer.java:run(1061)) -
> > > regionserver60020 exiting
> > > 2014-11-29 21:13:02,494 ERROR [main]
> > regionserver.HRegionServerCommandLine
> > > (HRegionServerCommandLine.java:start(70)) - Region server exiting
> > > java.lang.RuntimeException: HRegionServer Aborted
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
> > > 2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook
> > > (ShutdownHook.java:run(111)) - Shutdown hook starting;
> > > hbase.shutdown.hook=true;
> > >
> >
> fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
> > > 2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook
> > > (ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
> > > 2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient
> > > (DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode
> > > 32621
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> >
> /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
> > > could only be replicated to 0 nodes instead of minReplication (=1).
> > There
> > > are 1 datanode(s) running and no node(s) are excluded in this
> operation.
> > >
> > >
> > > On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <[email protected]>
> > > wrote:
> > > > > Looks like an HDFS issue. Are you sure your HDFS is working fine?
> > > >
> > > > HDFS appears to be working correctly - HBase will process requests
> > > > properly and everything appears to work correctly for hours/days,
> until
> > > the
> > > > regionserver randomly falls over.  If there were HDFS issues I would
> > > expect
> > > > to see these during normal operation, but I don't.
> > > >
> > >
> >
>

Re: HBase Regionserver randomly dies

Reply via email to