[ https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593575#comment-14593575 ]
Prasanth Jayachandran commented on HIVE-11043: ---------------------------------------------- Mostly looks good. Few questions/comments: 1) Can we use the same default for numSplits as MR? 1 instead of -1. This will make ETL strategy the default even in the presence of single small file. {code} return generateSplitsInfo(conf, -1); {code} 2) The condition should be numFiles <= context.minSplits right? This will avoid choosing BI in the case of 1 small file. 3) I tried some queries and numSplits arg in getSplits() can become 0. In which case we will end up using BI as default even though there are only small number of files. 4) Some more tests for these corner cases will be helpful. 5) Should we make this independently configurable? Instead of using the cache max size. > ORC split strategies should adapt based on number of files > ---------------------------------------------------------- > > Key: HIVE-11043 > URL: https://issues.apache.org/jira/browse/HIVE-11043 > Project: Hive > Issue Type: Bug > Affects Versions: 2.0.0 > Reporter: Prasanth Jayachandran > Assignee: Gopal V > Fix For: 2.0.0 > > Attachments: HIVE-11043.1.patch > > > ORC split strategies added in HIVE-10114 chose strategies based on average > file size. It would be beneficial to choose a different strategy based on > number of files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)