[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763302#comment-15763302 ]
Rui Li commented on HIVE-9153: ------------------------------ I guess no configuration is suitable for all cases :) If I remember, smaller "mapreduce.input.fileinputformat.split.maxsize" means more map tasks and is bad for performance when the data size is relatively big. So increasing it should help for most cases. Of course users should adjust it according to the cluster deployment, executor resources etc. I'm not sure what you mean by performance test JIRAs. We have quite a few JIRAs to improve performance, and I think each such JIRA involves some simple performance test to verify the improvement. But I don't remember all of them. > Perf enhancement on CombineHiveInputFormat and HiveInputFormat > -------------------------------------------------------------- > > Key: HIVE-9153 > URL: https://issues.apache.org/jira/browse/HIVE-9153 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Brock Noland > Assignee: Rui Li > Fix For: 1.1.0 > > Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, > HIVE-9153.2.patch, HIVE-9153.3.patch, screenshot.PNG > > > The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. > However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in > Spark, it might make sense for us to use {{HiveInputFormat}} as well. We > should evaluate this on a query which has many input splits such as {{select > count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)