[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765737#comment-15765737 ]
Sahil Takiar commented on HIVE-14165: ------------------------------------- Assigning to myself as [~ayousufi] is no longer working on this issue. I played around with this patch and found a similar speedup for a simple {{select * from s3_partitioned_table}} query where {{s3_partitioned_table}} has 500 partitions all stored on S3 (each partition contains a CSV file of ~80 KB in size). Performance improves by about 2x. The only problem I see with this patch is that it is technically a backwards incompatible change. Hive allows any custom {{InputFormat}} to be registered for a table, or for a partition. Before this patch, Hive guaranteed that the {{Path}} set in {{mapred.input.dir}} would always exist, and would always contain files of non-zero length. After this patch, the given {{Path}} may not exist, or may just be empty. This patch adds handling for {{FileInputFormat}}s, but given that a user can register any custom {{InputFormat}} with a table its possible some user queries may break. I'm not sure how much of an issue this is, technically the {{InputFormat}} API makes no claim about whether a given {{Path}} should exist or should not be empty. Also need to add some tests for this patch. > Remove Hive file listing during split computation > ------------------------------------------------- > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task > Affects Versions: 2.1.0 > Reporter: Abdullah Yousufi > Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)