[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252762#comment-14252762 ]
Rui Li commented on HIVE-9153: ------------------------------ Hi [~xuefuz] - if the spark cluster is the same as the hadoop cluster i.e. each executor is also a datanode, spark task scheduler usually does a good job to make sure all mappers have some locality (of course on condition that the mappers do specify a preferred location). In such case, more mappers won't impact data locality. bq. Is there a way to disable Spark's delayed schedule to try out? Spark task scheduler divides tasks into multiple lists according to locality level and attempts to launch tasks with highest locality level when an executor offers resources. It may also wait some time to schedule tasks in a lower level. I don't think there's a switch to turn it off. Actually I'm not 100% sure it's the delay schedule causing the issue. If all our tasks don't have preferred location, the delay may happen at start-up (waiting allowed locality level to drop) but not during execution. I'll look more into this. > Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] > --------------------------------------------------------------------- > > Key: HIVE-9153 > URL: https://issues.apache.org/jira/browse/HIVE-9153 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: spark-branch > Reporter: Brock Noland > Assignee: Rui Li > Attachments: screenshot.PNG > > > The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. > However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in > Spark, it might make sense for us to use {{HiveInputFormat}} as well. We > should evaluate this on a query which has many input splits such as {{select > count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)