RE: HBase Regionserver randomly dies

Robert Kent Mon, 01 Dec 2014 08:20:00 -0800

> From: Ted Yu [[email protected]]
> Sent: 01 December 2014 15:31
> To: [email protected]
> Subject: Re: HBase Regionserver randomly dies


> Can you check namenode log around the time 'Failed to close inode' error
> was thrown ?

> Thanks

Here are the errors from the logs:

2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient 
(DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicat
ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running 
and no node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
[snip]
2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller] hdfs.DFSClient 
(DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog 
(FSHLog.java:rollWriter(566)) - pre-sync failed
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog 
(FSHLog.java:rollWriter(588)) - Rolled WAL 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
 with entries=23544, filesize=121.9 M; new WAL 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog 
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog 
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog 
(FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1] wal.FSHLog 
(FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog 
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog 
(FSHLog.java:run(1255)) - Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient 
(DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller] hdfs.DFSClient 
(DFSOutputStream.java:flushOrSync(2007)) - Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
[snip]
2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog 
(FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error 
count=1
2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog 
(FSHLog.java:rollWriter(588)) - Rolled WAL 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
 with entries=5, filesize=0; new WAL 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
[snip]
2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog 
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient 
(DFSOutputStream.java:run(639)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] 
wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157)) - Got 
IOException while writing trailer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog 
(FSHLog.java:cleanupCurrentWriter(776)) - Failed close of HLog writer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog 
(FSHLog.java:cleanupCurrentWriter(779)) - Riding over HLog close failure! error 
count=2
2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog 
(FSHLog.java:rollWriter(588)) - Rolled WAL 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
 with entries=5, filesize=0; new WAL 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog 
(FSHLog.java:run(1140)) - Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
[snip]
2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller] 
regionserver.HRegionServer (HRegionServer.java:abort(1865)) - ABORTING region 
server extras1.ci.local,60020,1417171049368: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1417295579737
[snip]
2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller] 
regionserver.HRegionServer (HRegionServer.java:abort(1873)) - RegionServer 
abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller] 
regionserver.HRegionServer (HRegionServer.java:stop(1798)) - STOPPED: Failed 
log close in log roller
[snip]
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases 
(Leases.java:close(147)) - regionserver60020 closing leases
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases 
(Leases.java:close(150)) - regionserver60020 closed leases
2014-11-29 21:13:02,179 INFO  [regionserver60020] 
regionserver.ReplicationSource (ReplicationSource.java:terminate(860)) - 
Closing source Indexer_mhsaudit because: Region server is closing
2014-11-29 21:13:02,179 INFO  [regionserver60020] 
client.HConnectionManager$HConnectionImplementation 
(HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper 
sessionid=0x149f0f13a300434
2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper 
(ZooKeeper.java:close(684)) - Session: 0x149f0f13a300434 closed
2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread] 
zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper 
(ZooKeeper.java:close(684)) - Session: 0x149f0f13a300431 closed
2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread] 
zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] regionserver.HRegionServer 
(HRegionServer.java:run(1058)) - stopping server 
extras1.ci.local,60020,1417171049368; zookeeper connection closed.
2014-11-29 21:13:02,417 INFO  [regionserver60020] regionserver.HRegionServer 
(HRegionServer.java:run(1061)) - regionserver60020 exiting
2014-11-29 21:13:02,494 ERROR [main] regionserver.HRegionServerCommandLine 
(HRegionServerCommandLine.java:start(70)) - Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
        at 
org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at 
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook 
(ShutdownHook.java:run(111)) - Shutdown hook starting; 
hbase.shutdown.hook=true; 
fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook 
(ShutdownHook.java:run(120)) - Starting fs shutdown hook thread.
2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient 
(DFSClient.java:closeAllFilesBeingWritten(911)) - Failed to close inode 32621
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
 could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and no node(s) are excluded in this operation.


On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <[email protected]> wrote:
> > Looks like an HDFS issue. Are you sure your HDFS is working fine?
>
> HDFS appears to be working correctly - HBase will process requests
> properly and everything appears to work correctly for hours/days, until the
> regionserver randomly falls over.  If there were HDFS issues I would expect
> to see these during normal operation, but I don't.
>

RE: HBase Regionserver randomly dies

Reply via email to