hive initialization on executors

2015-04-26 Thread Manku Timma
I am facing an exception "Hive.get() called without a hive db setup" in the
executor. I wanted to understand how Hive object is initialized in the
executor threads? I only see Hive.get(hiveconf) in two places in spark 1.3
code.

In HiveContext.scala - I dont think this is created on the executor
In HiveMetastoreCatalog.scala - I am not sure if it is created on the
executor

Any information on how the hive code is bootstrapped on the executor will
be really helpful and I can do the debugging. I have compiled spark-1.3
with -Phive-provided.

In case you are curious the stacktrace is:-
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get() called without
a hive db setup
  at
org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:841)
  at
org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:776)
  at
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:253)
  at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
  at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
  at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
  at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
  at scala.Option.map(Option.scala:145)
  at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:216)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get()
called without a hive db setup
  at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:211)
  at
org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:797)


creating hive packages for spark

2015-04-27 Thread Manku Timma
Hello Spark developers,
I want to understand the procedure to create the org.spark-project.hive
jars. Is this documented somewhere? I am having issues with -Phive-provided
with my private hive13 jars and want to check if using spark's procedure
helps.


Re: creating hive packages for spark

2015-04-28 Thread Manku Timma
Yash,
This is exactly what I wanted! Thanks a bunch.

On 27 April 2015 at 15:39, yash datta  wrote:

> Hi,
>
> you can build spark-project hive from here :
>
> https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf
>
> Hope this helps.
>
>
> On Mon, Apr 27, 2015 at 3:23 PM, Manku Timma 
> wrote:
>
>> Hello Spark developers,
>> I want to understand the procedure to create the org.spark-project.hive
>> jars. Is this documented somewhere? I am having issues with
>> -Phive-provided
>> with my private hive13 jars and want to check if using spark's procedure
>> helps.
>>
>
>
>
> --
> When events unfold with calm and ease
> When the winds that blow are merely breeze
> Learn from nature, from birds and bees
> Live your life in love, and let joy not cease.
>


Re: hive initialization on executors

2015-04-29 Thread Manku Timma
The problem was in my hive-13 branch. So ignore this.

On 27 April 2015 at 10:34, Manku Timma  wrote:

> I am facing an exception "Hive.get() called without a hive db setup" in
> the executor. I wanted to understand how Hive object is initialized in the
> executor threads? I only see Hive.get(hiveconf) in two places in spark 1.3
> code.
>
> In HiveContext.scala - I dont think this is created on the executor
> In HiveMetastoreCatalog.scala - I am not sure if it is created on the
> executor
>
> Any information on how the hive code is bootstrapped on the executor will
> be really helpful and I can do the debugging. I have compiled spark-1.3
> with -Phive-provided.
>
> In case you are curious the stacktrace is:-
> java.lang.RuntimeException:
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get() called without
> a hive db setup
>   at
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:841)
>   at
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:776)
>   at
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:253)
>   at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
>   at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
>   at
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
>   at
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
>   at scala.Option.map(Option.scala:145)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:216)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive.get()
> called without a hive db setup
>   at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:211)
>   at
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:797)
>
>


Hive.get() called without HiveConf being already set on a yarn executor

2015-05-05 Thread Manku Timma
Looks like there is a case in TableReader.scala where Hive.get() is being
called without already setting it via Hive.get(hiveconf). I am running in
yarn-client mode (compiled with -Phive-provided and with hive-0.13.1a).
Basically this means the broadcasted hiveconf is not getting used and the
default HiveConf object is getting created and used -- which sounds wrong.
My understanding is that the HiveConf created on the driver should be used
on all executors for correct behaviour. The query I am running is:

  insert overwrite table X partition(month='2014-12')
  select colA, colB from Y where month='2014-12'

On the executor, it appears that the HiveContext is not created, so there
should have been one call to Hive.get(broadcastedHiveConf) somewhere which
runs only on the executor. Let me know if my analysis is correct and I can
file a JIRA For this.


  [1] org.apache.hadoop.hive.ql.metadata.Hive.get (Hive.java:211)
  [2]
org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler
(PlanUtils.java:810)
  [3]
org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler
(PlanUtils.java:789)
  [4]
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc
(TableReader.scala:253)
  [5] org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply
(TableReader.scala:229)
  [6] org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply
(TableReader.scala:229)
  [7] org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply
(HadoopRDD.scala:172)
  [8] org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply
(HadoopRDD.scala:172)
  [9] scala.Option.map (Option.scala:145)
  [10] org.apache.spark.rdd.HadoopRDD.getJobConf (HadoopRDD.scala:172)
  [11] org.apache.spark.rdd.HadoopRDD$$anon$1. (HadoopRDD.scala:216)
  [12] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:212)
  [13] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:101)
  [14] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277)
  [15] org.apache.spark.rdd.RDD.iterator (RDD.scala:244)
  [16] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
  [17] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277)
  [18] org.apache.spark.rdd.RDD.iterator (RDD.scala:244)
  [19] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
  [20] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277)
  [21] org.apache.spark.rdd.RDD.iterator (RDD.scala:244)
  [22] org.apache.spark.rdd.UnionRDD.compute (UnionRDD.scala:87)
  [23] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:277)
  [24] org.apache.spark.rdd.RDD.iterator (RDD.scala:244)
  [25] org.apache.spark.scheduler.ResultTask.runTask (ResultTask.scala:61)
  [26] org.apache.spark.scheduler.Task.run (Task.scala:64)
  [27] org.apache.spark.executor.Executor$TaskRunner.run
(Executor.scala:203)
  [28] java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1,145)
  [29] java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:615)
  [30] java.lang.Thread.run (Thread.java:745)


Re: Change for submitting to yarn in 1.3.1

2015-05-10 Thread Manku Timma
sc.applicationId gives the yarn appid.

On 11 May 2015 at 08:13, Mridul Muralidharan  wrote:

> We had a similar requirement, and as a stopgap, I currently use a
> suboptimal impl specific workaround - parsing it out of the
> stdout/stderr (based on log config).
> A better means to get to this is indeed required !
>
> Regards,
> Mridul
>
> On Sun, May 10, 2015 at 7:33 PM, Ron's Yahoo!
>  wrote:
> > Hi,
> >   I used to submit my Spark yarn applications by using
> org.apache.spark.yarn.deploy.Client api so I can get the application id
> after I submit it. The following is the code that I have, but after
> upgrading to 1.3.1, the yarn Client class was made into a private class. Is
> there a particular reason why this Client class was made private?
> >   I know that there’s a new SparkSubmit object that can be used, but
> it’s not clear to me how I can use it to get the application id after
> submitting to the cluster.
> >   Thoughts?
> >
> > Thanks,
> > Ron
> >
> > class SparkLauncherServiceImpl extends SparkLauncherService {
> >
> >   override def runApp(conf: Configuration, appName: String, queue:
> String): ApplicationId = {
> > val ws = SparkLauncherServiceImpl.getWorkspace()
> > val params = Array("--class", //
> > "com.xyz.sparkdb.service.impl.AssemblyServiceImpl", //
> > "--name", appName, //
> > "--queue", queue, //
> > "--driver-memory", "1024m", //
> > "--addJars",
> getListOfDependencyJars(s"$ws/ledp/le-sparkdb/target/dependency"), //
> > "--jar",
> s"file:$ws/ledp/le-sparkdb/target/le-sparkdb-1.0.3-SNAPSHOT.jar")
> > System.setProperty("SPARK_YARN_MODE", "true")
> > System.setProperty("spark.driver.extraJavaOptions",
> "-XX:PermSize=128m -XX:MaxPermSize=128m
> -Dsun.io.serialization.extendedDebugInfo=true")
> > val sparkConf = new SparkConf()
> > val args = new ClientArguments(params, sparkConf)
> > new Client(args, conf, sparkConf).runApp()
> >   }
> >
> >   private def getListOfDependencyJars(baseDir: String): String = {
> > val files = new
> File(baseDir).listFiles().filter(!_.getName().startsWith("spark-assembly"))
> > val prependedFiles = files.map(x => "file:" + x.getAbsolutePath())
> > val result = ((prependedFiles.tail.foldLeft(new
> StringBuilder(prependedFiles.head))) {(acc, e) => acc.append(",
> ").append(e)}).toString()
> > result
> >   }
> > }
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>