[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426846#comment-15426846
 ] 

Abdullah Yousufi commented on HIVE-14165:
-----------------------------------------

I believe when Hive calls getSplits() it's actually using 
{code}org.apache.hadoop.mapred.FileInputFormat{code}. And is the updated 
listStatus faster in the non-recursive case as well? Because if not, I think it 
doesn't make sense to pass in the recursive flag as true since Hive is only 
interested in the files in the top level of the path, since it currently calls 
getSplits() for each partition.

However, if Hive were changed to call getSplits() on the root directory in the 
partitioned case, then the listStatus(recursive) would make sense. I decided 
against this change because I was not sure how to best resolve partition 
elimination. For example if the query selects a single partition from a table, 
then doing the listStatus(recursive) on the root directory would be slower than 
just doing a listStatus on the single partition.

Also, Qubole mentions the following, which may be something to pursue in the 
future.
{code}
"we modified split computation to invoke listing at the level of the parent 
directory. This call returns all files (and their sizes) in all subdirectories 
in blocks of 1000. Some subdirectories and files may not be of interest to 
job/query e.g. partition elimination may be eliminated some of them. We take 
advantage of the fact that file listing is in lexicographic order and perform a 
modified merge join of the list of files and list of directories of interest."
{code}
When you mentioned earlier that Hadoop grabs 5000 objects at a time, is that 
including files in subdirectories?

> Remove Hive file listing during split computation
> -------------------------------------------------
>
>                 Key: HIVE-14165
>                 URL: https://issues.apache.org/jira/browse/HIVE-14165
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Abdullah Yousufi
>            Assignee: Abdullah Yousufi
>         Attachments: HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to