Ryan Blue created HADOOP-12810: ---------------------------------- Summary: FileSystem#listLocatedStatus causes unnecessary RPC calls Key: HADOOP-12810 URL: https://issues.apache.org/jira/browse/HADOOP-12810 Project: Hadoop Common Issue Type: Bug Components: fs, fs/s3 Affects Versions: 2.7.2 Reporter: Ryan Blue Assignee: Ryan Blue
{{FileSystem#listLocatedStatus}} lists the files in a directory and then calls {{getFileBlockLocations(stat.getPath(), ...)}} for each instead of {{getFileBlockLocations(stat, ...)}}. That function with the path arg just calls {{getFileStatus}} to get another file status from the path and calls the file status version, so this ends up calling {{getFileStatus}} unnecessarily. This is particularly bad for S3, where {{getFileStatus}} is expensive. Avoiding the extra call improved input split calculation time for a data set in S3 by ~20x: from 10 minutes to 25 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)