[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767455#comment-15767455
 ] 

Sahil Takiar commented on HIVE-14165:
-------------------------------------

[~poeppt] just attached an RB.

I agree we shouldn't make backwards incompatible changes to Hive. Let me know 
what you think of the RB.

There are some alternatives to this approach though:

* The file listing could be done in the background, by a dedicated thread
* Listing could be done eagerly rather than lazily so that the file listing 
does not block the fetch operator

This would offer a good speedup, but would require the same amount of metadata 
operations to S3.

> Remove Hive file listing during split computation
> -------------------------------------------------
>
>                 Key: HIVE-14165
>                 URL: https://issues.apache.org/jira/browse/HIVE-14165
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Abdullah Yousufi
>            Assignee: Sahil Takiar
>         Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to