[ 
https://issues.apache.org/jira/browse/HIVE-22964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051924#comment-17051924
 ] 

Peter Vary commented on HIVE-22964:
-----------------------------------

Hi [~aditya-shah],

There are multiple places where similar parallelization happens. See for 
example HIVE-22832.

What do you think about reusing the HIVE_MOVE_FILES_THREAD_COUNT configuration 
value for this as well? I know this is not ideal, but I see this config reused 
multiple times where we want to parallelize the file access/checks.

Also if there is an error when accessing one of the files, the original 
solution stops immediately, while the new solution will try to access all of 
the files - this could be problematic for tables on S3 with great number of 
files. (HIVE-22832 solves this as well)

 

Thanks,

Peter

> MM table split computation is very slow
> ---------------------------------------
>
>                 Key: HIVE-22964
>                 URL: https://issues.apache.org/jira/browse/HIVE-22964
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Aditya Shah
>            Assignee: Aditya Shah
>            Priority: Major
>         Attachments: HIVE-22964.patch
>
>
> Since for MM table we process the paths prior to inputFormat.getSplits() we 
> end up doing listing on the whole table at once. This could be optimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to