[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation by listing files in blocks

Steve Loughran (JIRA) Mon, 25 Jul 2016 10:36:44 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392338#comment-15392338
 ]


Steve Loughran commented on HIVE-14165:
---------------------------------------

If you look at the cost of listing in s3, you'll see that Hadoop already grabs 
5000 objects at a time. What hurts is directory tree walking, as each subdir 
needs to be recursively probed.

s3a will soon have an O(files/1000) recursive list. If you can use 
listFiles(path, recursive=true) you will get that speed

> Enable faster S3 Split Computation by listing files in blocks
> -------------------------------------------------------------
>
>                 Key: HIVE-14165
>                 URL: https://issues.apache.org/jira/browse/HIVE-14165
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Abdullah Yousufi
>            Assignee: Abdullah Yousufi
>
> During split computation when a large number of files are required to be 
> listed from S3, instead of executing 1 API call per file, one can optimize by 
> listing 1000 files in each API call. This would reduce the amount of time 
> required for listing files.
> Qubole has this optimization in place as detailed here: 
> https://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/?nabe=5695374637924352:0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation by listing files in blocks

Reply via email to