[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423120#comment-15423120
 ] 

Abdullah Yousufi commented on HIVE-14165:
-----------------------------------------

It calls FileSystem.java#listStatus(Path p, PathFilter filter). And that's 
correct, it verifies that there is at least one FileStatus under the current 
path, at which point it begins the logic of determining splits, primarily by 
calling InputFormat#getSplits(JobConf job, int numSplits). But 
FileInputFormat#getSplits(JobContext job) is going to call listStatus() anyway.

When I remove this listing, I get a 2x speed increase in a 500 partions S3 
table. Could FileInputFormat#getSplits(job) be modified to short-circuit return 
a FileNotFound Exception in the cases of a non-existent path and 0 files found, 
so that Hive could catch that and continue?

> Enable faster S3 Split Computation
> ----------------------------------
>
>                 Key: HIVE-14165
>                 URL: https://issues.apache.org/jira/browse/HIVE-14165
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Abdullah Yousufi
>            Assignee: Abdullah Yousufi
>
> Split size computation be may improved by the optimizations for listFiles() 
> in HADOOP-13208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to