hive initialization on executors
I am facing an exception "Hive.get() called without a hive db setup" in the executor. I wanted to understand how Hive object is initialized in the executor threads? I only see Hive.get(hiveconf) in two places in spark 1.3 code. In HiveContext.scala - I dont think this is created on the executor In HiveMetastoreCatalog.scala - I am not sure if it is created on the executor Any information on how the hive code is bootstrapped on the executor will be really helpful and I can do the debugging. I have compiled spark-1.3 with -Phive-provided. In case you are curious the stacktrace is:- java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get() called without a hive db setup at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:841) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:776) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:253) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get() called without a hive db setup at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:211) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:797)
creating hive packages for spark
Hello Spark developers, I want to understand the procedure to create the org.spark-project.hive jars. Is this documented somewhere? I am having issues with -Phive-provided with my private hive13 jars and want to check if using spark's procedure helps.
Re: creating hive packages for spark
Yash, This is exactly what I wanted! Thanks a bunch. On 27 April 2015 at 15:39, yash datta wrote: > Hi, > > you can build spark-project hive from here : > > https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf > > Hope this helps. > > > On Mon, Apr 27, 2015 at 3:23 PM, Manku Timma > wrote: > >> Hello Spark developers, >> I want to understand the procedure to create the org.spark-project.hive >> jars. Is this documented somewhere? I am having issues with >> -Phive-provided >> with my private hive13 jars and want to check if using spark's procedure >> helps. >> > > > > -- > When events unfold with calm and ease > When the winds that blow are merely breeze > Learn from nature, from birds and bees > Live your life in love, and let joy not cease. >
Re: hive initialization on executors
The problem was in my hive-13 branch. So ignore this. On 27 April 2015 at 10:34, Manku Timma wrote: > I am facing an exception "Hive.get() called without a hive db setup" in > the executor. I wanted to understand how Hive object is initialized in the > executor threads? I only see Hive.get(hiveconf) in two places in spark 1.3 > code. > > In HiveContext.scala - I dont think this is created on the executor > In HiveMetastoreCatalog.scala - I am not sure if it is created on the > executor > > Any information on how the hive code is bootstrapped on the executor will > be really helpful and I can do the debugging. I have compiled spark-1.3 > with -Phive-provided. > > In case you are curious the stacktrace is:- > java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get() called without > a hive db setup > at > org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:841) > at > org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:776) > at > org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:253) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) > at scala.Option.map(Option.scala:145) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get() > called without a hive db setup > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:211) > at > org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:797) > >
Hive.get() called without HiveConf being already set on a yarn executor
Looks like there is a case in TableReader.scala where Hive.get() is being called without already setting it via Hive.get(hiveconf). I am running in yarn-client mode (compiled with -Phive-provided and with hive-0.13.1a). Basically this means the broadcasted hiveconf is not getting used and the default HiveConf object is getting created and used -- which sounds wrong. My understanding is that the HiveConf created on the driver should be used on all executors for correct behaviour. The query I am running is: insert overwrite table X partition(month='2014-12') select colA, colB from Y where month='2014-12' On the executor, it appears that the HiveContext is not created, so there should have been one call to Hive.get(broadcastedHiveConf) somewhere which runs only on the executor. Let me know if my analysis is correct and I can file a JIRA For this. [1] org.apache.hadoop.hive.ql.metadata.Hive.get (Hive.java:211) [2] org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler (PlanUtils.java:810) [3] org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler (PlanUtils.java:789) [4] org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc (TableReader.scala:253) [5] org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply (TableReader.scala:229) [6] org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply (TableReader.scala:229) [7] org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply (HadoopRDD.scala:172) [8] org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply (HadoopRDD.scala:172) [9] scala.Option.map (Option.scala:145) [10] org.apache.spark.rdd.HadoopRDD.getJobConf (HadoopRDD.scala:172) [11] org.apache.spark.rdd.HadoopRDD$$anon$1. (HadoopRDD.scala:216) [12] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:212) [13] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:101) [14] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [15] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [16] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) [17] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [18] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [19] org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) [20] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [21] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [22] org.apache.spark.rdd.UnionRDD.compute (UnionRDD.scala:87) [23] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277) [24] org.apache.spark.rdd.RDD.iterator (RDD.scala:244) [25] org.apache.spark.scheduler.ResultTask.runTask (ResultTask.scala:61) [26] org.apache.spark.scheduler.Task.run (Task.scala:64) [27] org.apache.spark.executor.Executor$TaskRunner.run (Executor.scala:203) [28] java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1,145) [29] java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:615) [30] java.lang.Thread.run (Thread.java:745)
Re: Change for submitting to yarn in 1.3.1
sc.applicationId gives the yarn appid. On 11 May 2015 at 08:13, Mridul Muralidharan wrote: > We had a similar requirement, and as a stopgap, I currently use a > suboptimal impl specific workaround - parsing it out of the > stdout/stderr (based on log config). > A better means to get to this is indeed required ! > > Regards, > Mridul > > On Sun, May 10, 2015 at 7:33 PM, Ron's Yahoo! > wrote: > > Hi, > > I used to submit my Spark yarn applications by using > org.apache.spark.yarn.deploy.Client api so I can get the application id > after I submit it. The following is the code that I have, but after > upgrading to 1.3.1, the yarn Client class was made into a private class. Is > there a particular reason why this Client class was made private? > > I know that there’s a new SparkSubmit object that can be used, but > it’s not clear to me how I can use it to get the application id after > submitting to the cluster. > > Thoughts? > > > > Thanks, > > Ron > > > > class SparkLauncherServiceImpl extends SparkLauncherService { > > > > override def runApp(conf: Configuration, appName: String, queue: > String): ApplicationId = { > > val ws = SparkLauncherServiceImpl.getWorkspace() > > val params = Array("--class", // > > "com.xyz.sparkdb.service.impl.AssemblyServiceImpl", // > > "--name", appName, // > > "--queue", queue, // > > "--driver-memory", "1024m", // > > "--addJars", > getListOfDependencyJars(s"$ws/ledp/le-sparkdb/target/dependency"), // > > "--jar", > s"file:$ws/ledp/le-sparkdb/target/le-sparkdb-1.0.3-SNAPSHOT.jar") > > System.setProperty("SPARK_YARN_MODE", "true") > > System.setProperty("spark.driver.extraJavaOptions", > "-XX:PermSize=128m -XX:MaxPermSize=128m > -Dsun.io.serialization.extendedDebugInfo=true") > > val sparkConf = new SparkConf() > > val args = new ClientArguments(params, sparkConf) > > new Client(args, conf, sparkConf).runApp() > > } > > > > private def getListOfDependencyJars(baseDir: String): String = { > > val files = new > File(baseDir).listFiles().filter(!_.getName().startsWith("spark-assembly")) > > val prependedFiles = files.map(x => "file:" + x.getAbsolutePath()) > > val result = ((prependedFiles.tail.foldLeft(new > StringBuilder(prependedFiles.head))) {(acc, e) => acc.append(", > ").append(e)}).toString() > > result > > } > > } > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >