Looks like there is a case in TableReader.scala where Hive.get() is being called without already setting it via Hive.get(hiveconf). I am running in yarn-client mode (compiled with -Phive-provided and with hive-0.13.1a). Basically this means the broadcasted hiveconf is not getting used and the default HiveConf object is getting created and used -- which sounds wrong. My understanding is that the HiveConf created on the driver should be used on all executors for correct behaviour. The query I am running is:
insert overwrite table X partition(month='2014-12') select colA, colB from Y where month='2014-12' On the executor, it appears that the HiveContext is not created, so there should have been one call to Hive.get(broadcastedHiveConf) somewhere which runs only on the executor. Let me know if my analysis is correct and I can file a JIRA For this. [1] org.apache.hadoop.hive.ql.metadata.Hive.get (Hive.java:211) [2] org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler (PlanUtils.java:810) [3] org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler (PlanUtils.java:789) [4] org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc (TableReader.scala:253) [5] org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply (TableReader.scala:229) [6] org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply (TableReader.scala:229) [7] org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply (HadoopRDD.scala:172) [8] org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply (HadoopRDD.scala:172) [9] scala.Option.map (Option.scala:145) [10] org.apache.spark.rdd.HadoopRDD.getJobConf (HadoopRDD.scala:172) [11] org.apache.spark.rdd.HadoopRDD$$anon$1.<init> (HadoopRDD.scala:216) [12] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:212) [13] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:101) [14] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [15] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [16] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) [17] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [18] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [19] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) [20] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [21] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [22] org.apache.spark.rdd.UnionRDD.compute (UnionRDD.scala:87) [23] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [24] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [25] org.apache.spark.scheduler.ResultTask.runTask (ResultTask.scala:61) [26] org.apache.spark.scheduler.Task.run (Task.scala:64) [27] org.apache.spark.executor.Executor$TaskRunner.run (Executor.scala:203) [28] java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1,145) [29] java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:615) [30] java.lang.Thread.run (Thread.java:745)