[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593575#comment-14593575
 ] 

Prasanth Jayachandran commented on HIVE-11043:
----------------------------------------------

Mostly looks good. 
Few questions/comments:
1) Can we use the same default for numSplits as MR? 1 instead of -1. This will 
make ETL strategy the default even in the presence of single small file.
{code}
return generateSplitsInfo(conf, -1);
{code}
2) The condition should be numFiles <= context.minSplits right? This will avoid 
choosing BI in the case of 1 small file.
3) I tried some queries and numSplits arg in getSplits() can become 0. In which 
case we will end up using BI as default even though there are only small number 
of files.
4) Some more tests for these corner cases will be helpful.
5) Should we make this independently configurable? Instead of using the cache 
max size.

> ORC split strategies should adapt based on number of files
> ----------------------------------------------------------
>
>                 Key: HIVE-11043
>                 URL: https://issues.apache.org/jira/browse/HIVE-11043
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Gopal V
>             Fix For: 2.0.0
>
>         Attachments: HIVE-11043.1.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to