[ https://issues.apache.org/jira/browse/HDFS-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo Nicholas Sze resolved HDFS-1336. --------------------------------------- Resolution: Not a Problem I guess that this is not a problem anymore. Please feel free to reopen this if I am wrong. Resolving ... > TruncateBlock does not update in-memory information correctly > ------------------------------------------------------------- > > Key: HDFS-1336 > URL: https://issues.apache.org/jira/browse/HDFS-1336 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 0.20-append > Reporter: Thanh Do > > - Component: data node > > - Version: 0.20-append > > - Summary: we found a case that when a block is truncated during updateBlock, > the length on the ongoingCreates is not updated, hence leading to failed > append. > > - Setup: > # disks / datanode = 3 > # failures = 2 > failure type = crash > When/where failure happens = (see below) > > - Details: > 1) Client writes to dn1-dn2-dn3. Write successes. > 2) Now client tried to append. It first call dn1.recoverBlock().This > recoverBlock succeeds. > 3) Suppose the pipeline is dn3-dn2-dn1. Client sends packet to dn3. > dn3 forwards the packet to dn2 and writes to its disk (i.e dn3's disk). > Now, *dn2 crashes*, so that dn1 has not received this packet yet. > 4) Client calls dn1.recoverBlock() again, this time with dn3-dn1 in the > pipeline. > dn1 then calls dn3.startBlockRecovery() to terminate the writer thread in dn3. > get the *in memory* metadata info of the block, and verify that info with > the real file on disk. > dn3 maintains an in-memory data structure call *ongoingCreates* to record > information about currently-being-created block. If a block is finalized, then > its info is removed from *ongoingCreates*. > > Now suppose that at the time dn3 receives startBlockRecovery() request from > dn1, > it has: > + finished writing data to disk (hence, the block length on disk is 1024) > + set visible in memory length (hence, in memory length is also 1024) > but it *has not* finalized the block, hence the block info is still in the > *ongoingCreates*. > (Note: the interruption of writer thread makes the finalization never happens) > > Because of all above stuff, dn3 gives dn1 info about the block with length > 1024. > > 5. Now dn1 calls its own startBlockRecovery() successfully (because the > on-disk > file length and memory file length match, both are 512 byte). > > 6. Now, dn1 has a sync list (block_X_GS1 at dn1 with length 512, block_X_GS1 > at dn3 with length 1024). > it needs to make sure all dn in the pipeline agree on new GS and length. > dn1 calls NN.nextGS() to get new GS2. It form new block_X_GS2 with length > 512, and > call updateBlock on dn3 and itself. > > 7. dn3, receiving updateBlock request from dn1, does: > + rename the block from block_X_GS1 ==> block_X_GS2 > + truncate the block file length from 1024 to 512 > But, here is the key, it *does not update the length of the block kept > in ongoingCreates* > + return to dn1 successfully > > 8. Now, dn1 call its own updateBlock and *crashes*. > > 9. From client point of view, dn1.recoverBlock fails. > It retries call dn1.recoverBlock six times, and declare dn1 as bad. > > 10. Client now calls dn3.recoverBlock() > > 11. Dn3 in turns calls its startBlockRecovery() to > + interrupt block writer threads if any > + getBlockMetadataInfo (as part of forming the syncList, and updateBlock > later) > > it first look into ongoingCreates to see the block info is there, > and found it (because the block is not finalized). > Hence, in-memory length is 1024 (even though truncateBlock is > called before) > > verify if the in-memory length (1024) with on-disk length (512) > Hence, the *un-matched file length exception* > > 12. From client point of view, recoverBlock fails, because *All data nodes > are bad* > Client retries calling dn3.recoverBlock five more times and gets the same > exception, > Hence, append fails. > > Note: > - to fix it, i think when truncating the file, we need to update the > ongoingCreates too > (but i am not sure, if we fix thing like this, is there any other workload > may affect) > - interestingly, NN.leaseRecovery fails because of the exact exception at dn3. > - until dead node restarts and NN.leaseRecovery is triggered again, NN is > still the lease holder of the file > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message was sent by Atlassian JIRA (v6.2#6252)