[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426846#comment-15426846 ]
Abdullah Yousufi commented on HIVE-14165: ----------------------------------------- I believe when Hive calls getSplits() it's actually using {code}org.apache.hadoop.mapred.FileInputFormat{code}. And is the updated listStatus faster in the non-recursive case as well? Because if not, I think it doesn't make sense to pass in the recursive flag as true since Hive is only interested in the files in the top level of the path, since it currently calls getSplits() for each partition. However, if Hive were changed to call getSplits() on the root directory in the partitioned case, then the listStatus(recursive) would make sense. I decided against this change because I was not sure how to best resolve partition elimination. For example if the query selects a single partition from a table, then doing the listStatus(recursive) on the root directory would be slower than just doing a listStatus on the single partition. Also, Qubole mentions the following, which may be something to pursue in the future. {code} "we modified split computation to invoke listing at the level of the parent directory. This call returns all files (and their sizes) in all subdirectories in blocks of 1000. Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated some of them. We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the list of files and list of directories of interest." {code} When you mentioned earlier that Hadoop grabs 5000 objects at a time, is that including files in subdirectories? > Remove Hive file listing during split computation > ------------------------------------------------- > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task > Affects Versions: 2.1.0 > Reporter: Abdullah Yousufi > Assignee: Abdullah Yousufi > Attachments: HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)