> From: Ted Yu [[email protected]]
> Sent: 01 December 2014 15:31
> To: [email protected]
> Subject: Re: HBase Regionserver randomly dies
> Can you check namenode log around the time 'Failed to close inode' error
> was thrown ?
> Thanks
Here are the errors from the logs:
2014-11-29 21:12:59,277 WARN [Thread-125058] hdfs.DFSClient
(DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicat
ed to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running
and no node(s) are excluded in this operation.
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
[snip]
2014-11-29 21:12:59,306 WARN [regionserver60020.logRoller] hdfs.DFSClient
(DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,359 WARN [regionserver60020.logRoller] wal.FSHLog
(FSHLog.java:rollWriter(566)) - pre-sync failed
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,459 INFO [regionserver60020.logRoller] wal.FSHLog
(FSHLog.java:rollWriter(588)) - Rolled WAL
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
with entries=23544, filesize=121.9 M; new WAL
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog
(FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1] wal.FSHLog
(FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog
(FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,524 WARN [Thread-125063] hdfs.DFSClient
(DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,525 WARN [regionserver60020.logRoller] hdfs.DFSClient
(DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
[snip]
2014-11-29 21:12:59,538 WARN [regionserver60020.logRoller] wal.FSHLog
(FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error
count=1
2014-11-29 21:12:59,539 INFO [regionserver60020.logRoller] wal.FSHLog
(FSHLog.java:rollWriter(588)) - Rolled WAL
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
with entries=5, filesize=0; new WAL
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
[snip]
2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,743 WARN [Thread-125064] hdfs.DFSClient
(DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller]
wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got
IOException while writing trailer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog
(FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,746 WARN [regionserver60020.logRoller] wal.FSHLog
(FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error
count=2
2014-11-29 21:12:59,748 INFO [regionserver60020.logRoller] wal.FSHLog
(FSHLog.java:rollWriter(588)) - Rolled WAL
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
with entries=5, filesize=0; new WAL
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
[snip]
2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller]
regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING region
server extras1.ci.local,60020,1417171049368: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1417295579737
[snip]
2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller]
regionserver.HRegionServer (HRegionServer.java:abort(1873)) - RegionServer
abort: loaded coprocessors are:
[org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2014-11-29 21:13:00,522 INFO [regionserver60020.logRoller]
regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED: Failed
log close in log roller
[snip]
2014-11-29 21:13:02,132 INFO [regionserver60020] regionserver.Leases
(Leases.java:close(147)) - regionserver60020 closing leases
2014-11-29 21:13:02,132 INFO [regionserver60020] regionserver.Leases
(Leases.java:close(150)) - regionserver60020 closed leases
2014-11-29 21:13:02,179 INFO [regionserver60020]
regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) -
Closing source Indexer_mhsaudit because: Region server is closing
2014-11-29 21:13:02,179 INFO [regionserver60020]
client.HConnectionManager$HConnectionImplementation
(HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper
sessionid=0x149f0f13a300434
2014-11-29 21:13:02,180 INFO [regionserver60020] zookeeper.ZooKeeper
(ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
2014-11-29 21:13:02,272 INFO [regionserver60020-EventThread]
zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO [regionserver60020] zookeeper.ZooKeeper
(ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
2014-11-29 21:13:02,356 INFO [regionserver60020-EventThread]
zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO [regionserver60020] regionserver.HRegionServer
(HRegionServer.java:run(1058)) - stopping server
extras1.ci.local,60020,1417171049368; zookeeper connection closed.
2014-11-29 21:13:02,417 INFO [regionserver60020] regionserver.HRegionServer
(HRegionServer.java:run(1061)) - regionserver60020 exiting
2014-11-29 21:13:02,494 ERROR [main] regionserver.HRegionServerCommandLine
(HRegionServerCommandLine.java:start(70)) - Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
at
org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
at
org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
2014-11-29 21:13:02,715 INFO [Thread-10] regionserver.ShutdownHook
(ShutdownHook.java:run(111)) - Shutdown hook starting;
hbase.shutdown.hook=true;
fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
2014-11-29 21:13:02,716 INFO [Thread-10] regionserver.ShutdownHook
(ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient
(DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode 32621
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
could only be replicated to 0 nodes instead of minReplication (=1). There are
1 datanode(s) running and no node(s) are excluded in this operation.
On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <[email protected]> wrote:
> > Looks like an HDFS issue. Are you sure your HDFS is working fine?
>
> HDFS appears to be working correctly - HBase will process requests
> properly and everything appears to work correctly for hours/days, until the
> regionserver randomly falls over. If there were HDFS issues I would expect
> to see these during normal operation, but I don't.
>