Blocks that are under construction are not getting read if the blocks are more 
than 10. Only complete blocks are read properly. 
--------------------------------------------------------------------------------------------------------------------------------

                 Key: HDFS-1950
                 URL: https://issues.apache.org/jira/browse/HDFS-1950
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs client, name-node
    Affects Versions: 0.20-append
            Reporter: ramkrishna.s.vasudevan
             Fix For: 0.20-append


Before going to the root cause lets see the read behavior for a file having 
more than 10 blocks in append case.. 
Logic: 
==== 
There is prefetch size dfs.read.prefetch.size for the DFSInputStream which has 
default value of 10 
This prefetch size is the number of blocks that the client will fetch from the 
namenode for reading a file.. 
For example lets assume that a file X having 22 blocks is residing in HDFS 
The reader first fetches first 10 blocks from the namenode and start reading 
After the above step , the reader fetches the next 10 blocks from NN and 
continue reading 
Then the reader fetches the remaining 2 blocks from NN and complete the write 
Cause: 
======= 
Lets see the cause for this issue now... 
Scenario that will fail is "Writer wrote 10+ blocks and a partial block and 
called sync. Reader trying to read the file will not get the last partial 
block" . 

Client first gets the 10 block locations from the NN. Now it checks whether the 
file is under construction and if so it gets the size of the last partial block 
from datanode and reads the full file 
However when the number of blocks is more than 10, the last block will not be 
in the first fetch. It will be in the second or other blocks(last block will be 
in (num of blocks / 10)th fetch) 
The problem now is, in DFSClient there is no logic to get the size of the last 
partial block(as in case of point 1), for the rest of the fetches other than 
first fetch, the reader will not be able to read the complete data 
synced...........!! 

also the InputStream.available api uses the first fetched block size to 
iterate. Ideally this size has to be increased




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to