Hi, There: While I used hadoop 0.20.9-yahoo distribution and hbase 0.20.4 version, I found that the hadoop lose blocks under certain situation, and thus corrupt hbase tables.
I compared namenode, datanode and hbase regionserver and figured out the reason. The regionserver 10.110.8.85 asks namenode 10.110.8.83 to save a block, 10.110.8.84 gives Multiple IP, regionserver choose 10.110.8.63 and save the block there. After a while, namenode Asks the bock to be replicated to 10.110.8..86 and 10.110.8..69 machines. A moment late, .86. .69 received The replication, but strangely, 10.110.8..59 10.110.8..85 also received replicaton of the block., even though it Is not in the replication list. Then the chooseExcessReplicates asks to delele excess from .63, .69 , thinking there are too Many replica. Even though .63 was the original copy, the algorithm choose to delete block based On the amount of empty disk. A moment later addToInvalidates ( not from chooseExcessReplicates ) asks The block to be deleted on .86, .85, .59. I check the code, this can only be possible if The block is corrupted. In the end, this block doesn't exist anywhere in the cluster. And it is permanently lost. namenode: 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.chooseExcessReplicates: (10.110.8.63:50010, blk_5636039758999247483_31304886) is added to recentInvalidateSets 2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.63:50010 to delete blk_5636039758999247483_31304886 blk_434931004890442915 7_31304519 the block was initially added to 10.110.8.63, then replicated to 10.110.8.63 59 69 86 85 . subsequently, replication process addToInvalidates removed all of them. the code review shows that the replicate is corrupt, and all get deleted. 2010-05-18 21:21:29,913 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addStoredBlock: blockMap updated: 10.110.8.63:50010 is added to blk_563603975 8999247483_31304886 size 441 2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.63:50010 to replicate blk_5636039758999247483_31304886 to datanode(s) 10.11 0.8.86:50010 10.110.8.69:50010 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.chooseExcessReplicates: (10.110.8.63:50010, blk_5636039758999247483_31304886) is added to recentInvalidateSets 2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.63:50010 to delete blk_5636039758999247483_31304886 blk_434931004890442915 7_31304519 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8. 63:50010 2010-05-18 21:26:45,953 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.63:50010 to delete blk_-1838286221287242082_31305179 blk_84467625641825134 17_31305143 blk_5636039758999247483_31304886 blk_4628640249731313760_31305046 bl k_7460947863067370701_31270225 blk_-4468681536500281247_31270225 blk_84535177111 01429609_31303917 blk_9126133835045521966_31303972 blk_4623110280826973929_31305 203 blk_-2581238696314957800_31305033 blk_7461125351290749755_31305052 2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.63:50010 to replicate blk_5636039758999247483_31304886 to datanode(s) 10.11 0.8.86:50010 10.110.8.69:50010 2010-05-18 21:21:33,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addStoredBlock: blockMap updated: 10.110.8.69:50010 is added to blk_563603975 8999247483_31304886 size 441 2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.chooseExcessReplicates: (10.110.8.69:50010, blk_5636039758999247483_31304886) is added to recentInvalidateSets 2010-05-18 21:21:59,005 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.69:50010 to delete blk_5636039758999247483_31304886 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8. 69:50010 2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.69:50010 to delete blk_-2124965527858346013_31270213 blk_-5027506345849158 498_31270213 blk_5636039758999247483_31304886 blk_9148821113904458973_31305189 b lk_4850797749721229572_31305072 blk_252039065084461924_31305031 blk_-83518367280 09062091_31305208 blk_-7576696059515014894_31305194 blk_-2900250119736465962_312 70214 blk_471700613578524871_31304950 blk_-190744003190006044_31305064 blk_72650 57386742001625_31305073 2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10. 10.8.63:50010 to replicate blk_5636039758999247483_31304886 to datanode(s) 10.1 0.8.86:50010 10.110.8.69:50010 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSys em.addStoredBlock: blockMap updated: 10.110.8.86:50010 is added to blk_56360397 8999247483_31304886 size 441 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSys em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8 86:50010 2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10. 10.8.86:50010 to delete blk_-6242136662924452584_31259201 blk_5636039758999247 83_31304886 blk_4850797749721229572_31305072 blk_252039065084461924_31305031 bl _-1317144678443645904_31305204 blk_6050185755706975664_31270230 blk_26714169718 5801868_31304948 blk_-5582352089328547938_31305022 blk_-3115115738671914626_312 0210 2010-05-18 21:21:34,413 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addStoredBlock: blockMap updated: 10.110.8.59:50010 is added to blk_563603975 8999247483_31304886 size 441 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8. 59:50010 2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.59:50010 to delete blk_5636039758999247483_31304886 blk_-45285121566353996 25_31305212 blk_1439789418382469336_31305158 blk_8860574934531794641_31270219 bl k_-8358193301564392132_31305029 2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addStoredBlock: blockMap updated: 10.110.8.85:50010 is added to blk_563603975 8999247483_31304886 size 441 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSyst em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8. 85:50010 2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1 10.8.85:50010 to delete blk_-6242136662924452584_31259201 blk_56360397589992474 83_31304886 blk_4628640249731313760_31305046 blk_4747588241975451642_31305123 bl k_-6876078628884993825_31270230 blk_-4468681536500281247_31270225 blk_7325830193 509411302_31270230 blk_8453517711101429609_31303917 blk_-6094734447689285387_313 05127 blk_3353439739797003235_31305037 blk_-5027506345849158498_31270213 blk_148 4161645992497144_31270225 blk_4464987648045469454_31305144 blk_74609478630673707 01_31270225 blk_-1170815606945644545_31270230 blk_6050185755706975664_31270230 b lk_-8358193301564392132_31305029 blk_2671416971885801868_31304948 blk_5593547375 459437465_31286511 blk_-2581238696314957800_31305033 blk_4732635559915402193_312 70230 blk_-2124965527858346013_31270213 blk_-5837992573431863412_31286612 blk_-4 32558447034944954_31270208 blk_-3407615138527189735_31305069 blk_886057493453179 4641_31270219 blk_233110856487529716_31270229 blk_312750273180273303_31270228 bl k_7461125351290749755_31305052 blk_-8902661185532055148_31304947 blk_-8555258258 738129670_31270210 blk_252039065084461924_31305031 blk_9037118763503479133_31305 120 blk_-8494656323754369174_31305105 blk_9126133835045521966_31303972 blk_-5582 352089328547938_31305022 blk_-2900250119736465962_31270214 blk_-3115115738671914 626_31270210 blk_7612090442234634555_31270225 blk_5876492007747505188_31270213 b lk_471700613578524871_31304950 blk_-190744003190006044_31305064 datanode 10.110.8.63: 2010-05-18 21:21:46,058 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: De leting block blk_5636039758999247483_31304886 file /hadoop_data_dir/dfs/data/cur rent/subdir23/blk_5636039758999247483 hbase region server 10.110.8.85: DFSClient.java, DatanodeInfo chosenNode = bestNode(nodes, deadNodes); InetSocketAddress targetAddr = NetUtils.createSocketAddr(chosenNode.getName()); return new DNAddrPair(chosenNode, targetAddr); still picked the 10.110.8.63 even though the command is sent from name node 21:21:43,995 to delete the block, and it was executed at 21:21:46,058 . ? 2010-05-18 21:21:46,188 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.110.8.63:50010 for file /hbase/.META./1028785192/info/656097411976846533 for block 5636039758999247483:java.io.IOException: Got error in response to OP_ READ_BLOCK for file /hbase/.META./1028785192/info/656097411976846533 for block 5 636039758999247483
2010-05-18 21:21:29,731 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.META./1028785192/info/656097411976846533. blk_5636039758999247483_31304886 2010-05-18 21:21:29,913 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.110.8.63:50010 is added to blk_5636039758999247483_31304886 size 441 2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.63:50010 to replicate blk_5636039758999247483_31304886 to datanode(s) 10.110.8.86:50010 10.110.8.69:50010 2010-05-18 21:21:33,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.110.8.69:50010 is added to blk_5636039758999247483_31304886 size 441 2010-05-18 21:21:34,413 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.110.8.59:50010 is added to blk_5636039758999247483_31304886 size 441 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.110.8.86:50010 is added to blk_5636039758999247483_31304886 size 441 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (10.110.8.63:50010, blk_5636039758999247483_31304886) is added to recentInvalidateSets 2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.63:50010 to delete blk_5636039758999247483_31304886 blk_4349310048904429157_31304519 2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.110.8.85:50010 is added to blk_5636039758999247483_31304886 size 441 2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (10.110.8.69:50010, blk_5636039758999247483_31304886) is added to recentInvalidateSets 2010-05-18 21:21:59,005 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.69:50010 to delete blk_5636039758999247483_31304886 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8.63:50010 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8.69:50010 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8.59:50010 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8.86:50010 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_5636039758999247483 is added to invalidSet of 10.110.8.85:50010 2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.59:50010 to delete blk_5636039758999247483_31304886 blk_-4528512156635399625_31305212 blk_1439789418382469336_31305158 blk_8860574934531794641_31270219 blk_-8358193301564392132_31305029 2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.85:50010 to delete blk_-6242136662924452584_31259201 blk_5636039758999247483_31304886 blk_4628640249731313760_31305046 blk_4747588241975451642_31305123 blk_-6876078628884993825_31270230 blk_-4468681536500281247_31270225 blk_7325830193509411302_31270230 blk_8453517711101429609_31303917 blk_-6094734447689285387_31305127 blk_3353439739797003235_31305037 blk_-5027506345849158498_31270213 blk_1484161645992497144_31270225 blk_4464987648045469454_31305144 blk_7460947863067370701_31270225 blk_-1170815606945644545_31270230 blk_6050185755706975664_31270230 blk_-8358193301564392132_31305029 blk_2671416971885801868_31304948 blk_5593547375459437465_31286511 blk_-2581238696314957800_31305033 blk_4732635559915402193_31270230 blk_-2124965527858346013_31270213 blk_-5837992573431863412_31286612 blk_-432558447034944954_31270208 blk_-3407615138527189735_31305069 blk_8860574934531794641_31270219 blk_23311085648! 7529716_31270229 blk_312750273180273303_31270228 blk_7461125351290749755_31305052 blk_-8902661185532055148_31304947 blk_-8555258258738129670_31270210 blk_252039065084461924_31305031 blk_9037118763503479133_31305120 blk_-8494656323754369174_31305105 blk_9126133835045521966_31303972 blk_-5582352089328547938_31305022 blk_-2900250119736465962_31270214 blk_-3115115738671914626_31270210 blk_7612090442234634555_31270225 blk_5876492007747505188_31270213 blk_471700613578524871_31304950 blk_-190744003190006044_31305064 2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.86:50010 to delete blk_-6242136662924452584_31259201 blk_5636039758999247483_31304886 blk_4850797749721229572_31305072 blk_252039065084461924_31305031 blk_-1317144678443645904_31305204 blk_6050185755706975664_31270230 blk_2671416971885801868_31304948 blk_-5582352089328547938_31305022 blk_-3115115738671914626_31270210 2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.69:50010 to delete blk_-2124965527858346013_31270213 blk_-5027506345849158498_31270213 blk_5636039758999247483_31304886 blk_9148821113904458973_31305189 blk_4850797749721229572_31305072 blk_252039065084461924_31305031 blk_-8351836728009062091_31305208 blk_-7576696059515014894_31305194 blk_-2900250119736465962_31270214 blk_471700613578524871_31304950 blk_-190744003190006044_31305064 blk_7265057386742001625_31305073 2010-05-18 21:26:45,953 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.110.8.63:50010 to delete blk_-1838286221287242082_31305179 blk_8446762564182513417_31305143 blk_5636039758999247483_31304886 blk_4628640249731313760_31305046 blk_7460947863067370701_31270225 blk_-4468681536500281247_31270225 blk_8453517711101429609_31303917 blk_9126133835045521966_31303972 blk_4623110280826973929_31305203 blk_-2581238696314957800_31305033 blk_7461125351290749755_31305052