[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392338#comment-15392338 ]
Steve Loughran commented on HIVE-14165: --------------------------------------- If you look at the cost of listing in s3, you'll see that Hadoop already grabs 5000 objects at a time. What hurts is directory tree walking, as each subdir needs to be recursively probed. s3a will soon have an O(files/1000) recursive list. If you can use listFiles(path, recursive=true) you will get that speed > Enable faster S3 Split Computation by listing files in blocks > ------------------------------------------------------------- > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task > Affects Versions: 2.1.0 > Reporter: Abdullah Yousufi > Assignee: Abdullah Yousufi > > During split computation when a large number of files are required to be > listed from S3, instead of executing 1 API call per file, one can optimize by > listing 1000 files in each API call. This would reduce the amount of time > required for listing files. > Qubole has this optimization in place as detailed here: > https://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/?nabe=5695374637924352:0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)