Can you look on the DN in question and see whether it was succesfully finalized when the write finished? It doesn't sound like a successful write -- should have moved it out of the bbw directory into current/
-Todd On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G <mahesw...@huawei.com> wrote: > Hi All, > > > > I have backported HDFS-1779 to our Hadoop version which is based on > 0.20-Append branch. > > We are running a load test, as usual. (We want to ensure the reliability of > the system under heavy loads.) > My cluster has 8 DataNodes and a Namenode > Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity. > Clients are running along with Datanodes. > Clients will upload some tar files containing 3-4 blocks, from 50 threads. > Each block size is 256MB. replication factor is 3. > > Everything looks to be fine on a normal load. > When the load is increased, lot of errors are happening. > Many pipeline failures are happening also. > All these are fine, except for the strange case of few blocks. > > Some blocks (around 30) are missing (FSCK report shows). > When I tried to read that files, it fails saying that No Datanodes for this > block > Analysing the logs, we found that, for these blocks, pipeline recovery > happened, write was successful to a single Datanode. > Also, Datanode reported the block to Namenode in a blockReceived command. > After some time (say, 30 minutes), the Datanode is getting restarted. > In the BBW (BlocksBeingWritten) report send by DN immediately after restart, > these finalized blocks are also included. (Showing that these blocks are in > blocksBeingWritten folder) > In many of the cases, the generation timestamp reported in the BBW report is > the old timestamp. > > Namenode is rejecting that block in the BBW report by saying file is already > closed. > Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing > the same. > When deleting the blocks also, it is printing the path from > BlocksBeingWritten directory. (Also the previous generation timestamp) > > Looks very strange for me. > Does this means that the finalized block file & meta file (which is written > in current folder) is getting lost after DN restart > Due to which Namenode will not receive these block's information in the BLOCK > REPORT send from the Datanodes. > > > > > > Regards, > > Uma > -- Todd Lipcon Software Engineer, Cloudera