Can you look on the DN in question and see whether it was succesfully
finalized when the write finished? It doesn't sound like a successful
write -- should have moved it out of the bbw directory into current/

-Todd

On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
<mahesw...@huawei.com> wrote:
> Hi All,
>
>
>
> I have backported HDFS-1779 to our Hadoop version which is based on 
> 0.20-Append branch.
>
> We are running a load test, as usual. (We want to ensure the reliability of 
> the system under heavy loads.)
> My cluster has 8 DataNodes and a Namenode
> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
> Clients are running along with Datanodes.
> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
> Each block size is 256MB. replication factor is 3.
>
> Everything looks to be fine on a normal load.
> When the load is increased, lot of errors are happening.
> Many pipeline failures are happening also.
> All these are fine, except for the strange case of few blocks.
>
> Some blocks (around 30) are missing (FSCK report shows).
> When I tried to read that files, it fails saying that No Datanodes for this 
> block
> Analysing the logs, we found that, for these blocks, pipeline recovery 
> happened, write was successful to a single Datanode.
> Also, Datanode reported the block to Namenode in a blockReceived command.
> After some time (say, 30 minutes), the Datanode is getting restarted.
> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, 
> these finalized blocks are also included. (Showing that these blocks are in 
> blocksBeingWritten folder)
> In many of the cases, the generation timestamp reported in the BBW report is 
> the old timestamp.
>
> Namenode is rejecting that block in the BBW report by saying file is already 
> closed.
> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing 
> the same.
> When deleting the blocks also, it is printing the path from 
> BlocksBeingWritten directory. (Also the previous generation timestamp)
>
> Looks very strange for me.
> Does this means that the finalized block file & meta file (which is written 
> in current folder) is getting lost after DN restart
> Due to which Namenode will not receive these block's information in the BLOCK 
> REPORT send from the Datanodes.
>
>
>
>
>
> Regards,
>
> Uma
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to