While I used hadoop 0.20.9-yahoo distribution and hbase 0.20.4
version,
I
found that the hadoop lose blocks under certain
situation, and thus corrupt hbase tables.
I compared namenode, datanode and hbase regionserver and figured out
the
reason.
The regionserver 10.110.8.85 asks namenode 10.110.8.83 to save a block,
10.110.8.84 gives
Multiple IP, regionserver choose 10.110.8.63 and save the block there.
After a while, namenode
Asks the bock to be replicated to 10.110.8..86 and 10.110.8..69
machines.
A moment late, .86. .69 received
The replication, but strangely, 10.110.8..59 10.110.8..85 also
received
replicaton of the block., even though it
Is not in the replication list.
Then the chooseExcessReplicates asks to delele excess from .63, .69 ,
thinking there are too
Many replica. Even though .63 was the original copy, the algorithm
choose
to delete block based
On the amount of empty disk. A moment later addToInvalidates ( not
from
chooseExcessReplicates ) asks
The block to be deleted on .86, .85, .59. I check the code, this can
only
be possible if
The block is corrupted.
In the end, this block doesn't exist anywhere in the cluster. And it is
permanently lost.
namenode:
2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.chooseExcessReplicates: (10.110.8.63:50010,
blk_5636039758999247483_31304886)
is added to recentInvalidateSets
2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.63:50010 to delete blk_5636039758999247483_31304886
blk_434931004890442915
7_31304519
the block was initially added to 10.110.8.63, then replicated to
10.110.8.63 59 69 86 85 . subsequently, replication process
addToInvalidates
removed all of them. the code review shows that the replicate is
corrupt,
and all get deleted.
2010-05-18 21:21:29,913 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addStoredBlock: blockMap updated: 10.110.8.63:50010 is added to
blk_563603975
8999247483_31304886 size 441
2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.63:50010 to replicate blk_5636039758999247483_31304886 to
datanode(s)
10.11
0.8.86:50010 10.110.8.69:50010
2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.chooseExcessReplicates: (10.110.8.63:50010,
blk_5636039758999247483_31304886)
is added to recentInvalidateSets
2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.63:50010 to delete blk_5636039758999247483_31304886
blk_434931004890442915
7_31304519
2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.
63:50010
2010-05-18 21:26:45,953 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.63:50010 to delete blk_-1838286221287242082_31305179
blk_84467625641825134
17_31305143 blk_5636039758999247483_31304886
blk_4628640249731313760_31305046 bl
k_7460947863067370701_31270225 blk_-4468681536500281247_31270225
blk_84535177111
01429609_31303917 blk_9126133835045521966_31303972
blk_4623110280826973929_31305
203 blk_-2581238696314957800_31305033 blk_7461125351290749755_31305052
2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.63:50010 to replicate blk_5636039758999247483_31304886 to
datanode(s)
10.11
0.8.86:50010 10.110.8.69:50010
2010-05-18 21:21:33,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addStoredBlock: blockMap updated: 10.110.8.69:50010 is added to
blk_563603975
8999247483_31304886 size 441
2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.chooseExcessReplicates: (10.110.8.69:50010,
blk_5636039758999247483_31304886)
is added to recentInvalidateSets
2010-05-18 21:21:59,005 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.69:50010 to delete blk_5636039758999247483_31304886
2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.
69:50010
2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.69:50010 to delete blk_-2124965527858346013_31270213
blk_-5027506345849158
498_31270213 blk_5636039758999247483_31304886
blk_9148821113904458973_31305189 b
lk_4850797749721229572_31305072 blk_252039065084461924_31305031
blk_-83518367280
09062091_31305208 blk_-7576696059515014894_31305194
blk_-2900250119736465962_312
70214 blk_471700613578524871_31304950 blk_-190744003190006044_31305064
blk_72650
57386742001625_31305073
2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.
10.8.63:50010 to replicate blk_5636039758999247483_31304886 to
datanode(s)
10.1
0.8.86:50010 10.110.8.69:50010
2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSys
em.addStoredBlock: blockMap updated: 10.110.8.86:50010 is added to
blk_56360397
8999247483_31304886 size 441
2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSys
em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8
86:50010
2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.
10.8.86:50010 to delete blk_-6242136662924452584_31259201
blk_5636039758999247
83_31304886 blk_4850797749721229572_31305072
blk_252039065084461924_31305031
bl
_-1317144678443645904_31305204 blk_6050185755706975664_31270230
blk_26714169718
5801868_31304948 blk_-5582352089328547938_31305022
blk_-3115115738671914626_312
0210
2010-05-18 21:21:34,413 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addStoredBlock: blockMap updated: 10.110.8.59:50010 is added to
blk_563603975
8999247483_31304886 size 441
2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.
59:50010
2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.59:50010 to delete blk_5636039758999247483_31304886
blk_-45285121566353996
25_31305212 blk_1439789418382469336_31305158
blk_8860574934531794641_31270219 bl
k_-8358193301564392132_31305029
2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addStoredBlock: blockMap updated: 10.110.8.85:50010 is added to
blk_563603975
8999247483_31304886 size 441
2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst
em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.
85:50010
2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
10.1
10.8.85:50010 to delete blk_-6242136662924452584_31259201
blk_56360397589992474
83_31304886 blk_4628640249731313760_31305046
blk_4747588241975451642_31305123 bl
k_-6876078628884993825_31270230 blk_-4468681536500281247_31270225
blk_7325830193
509411302_31270230 blk_8453517711101429609_31303917
blk_-6094734447689285387_313
05127 blk_3353439739797003235_31305037
blk_-5027506345849158498_31270213
blk_148
4161645992497144_31270225 blk_4464987648045469454_31305144
blk_74609478630673707
01_31270225 blk_-1170815606945644545_31270230
blk_6050185755706975664_31270230 b
lk_-8358193301564392132_31305029 blk_2671416971885801868_31304948
blk_5593547375
459437465_31286511 blk_-2581238696314957800_31305033
blk_4732635559915402193_312
70230 blk_-2124965527858346013_31270213
blk_-5837992573431863412_31286612
blk_-4
32558447034944954_31270208 blk_-3407615138527189735_31305069
blk_886057493453179
4641_31270219 blk_233110856487529716_31270229
blk_312750273180273303_31270228 bl
k_7461125351290749755_31305052 blk_-8902661185532055148_31304947
blk_-8555258258
738129670_31270210 blk_252039065084461924_31305031
blk_9037118763503479133_31305
120 blk_-8494656323754369174_31305105 blk_9126133835045521966_31303972
blk_-5582
352089328547938_31305022 blk_-2900250119736465962_31270214
blk_-3115115738671914
626_31270210 blk_7612090442234634555_31270225
blk_5876492007747505188_31270213 b
lk_471700613578524871_31304950 blk_-190744003190006044_31305064
datanode 10.110.8.63:
2010-05-18 21:21:46,058 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: De
leting block blk_5636039758999247483_31304886 file
/hadoop_data_dir/dfs/data/cur
rent/subdir23/blk_5636039758999247483
hbase region server 10.110.8.85:
DFSClient.java,
DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
InetSocketAddress targetAddr =
NetUtils.createSocketAddr(chosenNode.getName());
return new DNAddrPair(chosenNode, targetAddr);
still picked the 10.110.8.63 even though the command is sent from name
node
21:21:43,995
to delete the block, and it was executed at 21:21:46,058 . ?
2010-05-18 21:21:46,188 WARN org.apache.hadoop.hdfs.DFSClient: Failed
to
connect
to /10.110.8.63:50010 for file
/hbase/.META./1028785192/info/656097411976846533
for block 5636039758999247483:java.io.IOException: Got error in
response
to
OP_
READ_BLOCK for file /hbase/.META./1028785192/info/656097411976846533
for
block 5
636039758999247483