shanyu zhao created HADOOP-15320:
------------------------------------

             Summary: Remove customized getFileBlockLocations for hadoop-azure 
and hadoop-azure-datalake
                 Key: HADOOP-15320
                 URL: https://issues.apache.org/jira/browse/HADOOP-15320
             Project: Hadoop Common
          Issue Type: Bug
          Components: fs/adl, fs/azure
    Affects Versions: 3.0.0, 2.9.0, 2.7.3
            Reporter: shanyu zhao
            Assignee: shanyu zhao


hadoop-azure and hadoop-azure-datalake have its own implementation of 
getFileBlockLocations(), which faked a list of artificial blocks based on the 
hard-coded block size. And each block has one host with name "localhost". Take 
a look at this code:

[https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485]

This is a unnecessary mock up for a "remote" file system to mimic HDFS. And the 
problem with this mock is that for large (~TB) files we generates lots of 
artificial blocks, and FileInputFormat.getSplits() is slow in calculating 
splits based on these blocks.

We can safely remove this customized getFileBlockLocations() implementation, 
fall back to the default FileSystem.getFileBlockLocations() implementation, 
which is to return 1 block for any file with 1 host "localhost". Note that this 
doesn't mean we will create much less splits, because the number of splits is 
still limited by the blockSize in FileInputFormat.computeSplitSize():
{code:java}
return Math.max(minSize, Math.min(goalSize, blockSize));{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to