[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765737#comment-15765737
 ] 

Sahil Takiar commented on HIVE-14165:
-------------------------------------

Assigning to myself as [~ayousufi] is no longer working on this issue.

I played around with this patch and found a similar speedup for a simple 
{{select * from s3_partitioned_table}} query where {{s3_partitioned_table}} has 
500 partitions all stored on S3 (each partition contains a CSV file of ~80 KB 
in size). Performance improves by about 2x.

The only problem I see with this patch is that it is technically a backwards 
incompatible change. Hive allows any custom {{InputFormat}} to be registered 
for a table, or for a partition. Before this patch, Hive guaranteed that the 
{{Path}} set in {{mapred.input.dir}} would always exist, and would always 
contain files of non-zero length. After this patch, the given {{Path}} may not 
exist, or may just be empty. This patch adds handling for {{FileInputFormat}}s, 
but given that a user can register any custom {{InputFormat}} with a table its 
possible some user queries may break.

I'm not sure how much of an issue this is, technically the {{InputFormat}} API 
makes no claim about whether a given {{Path}} should exist or should not be 
empty.

Also need to add some tests for this patch.

> Remove Hive file listing during split computation
> -------------------------------------------------
>
>                 Key: HIVE-14165
>                 URL: https://issues.apache.org/jira/browse/HIVE-14165
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Abdullah Yousufi
>            Assignee: Sahil Takiar
>         Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to