[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250373#comment-14250373 ]
Brock Noland commented on HIVE-9127: ------------------------------------ bq. In looking into HIVE-9135, I was wondering if it is better to fix the root cause of HIVE-7431 instead disabling the cache for Spark. I think that would be awesome. I think we disabled it early on when we were just trying to get HOS working. bq. If so, probably we don't need this work around? I think this "work around" results in better code generally. In CombineHiveInputFormat we were looking up the partition information on each loop iteration but with this fix we do it once before the loop, which is generally better. > Improve CombineHiveInputFormat.getSplit performance > --------------------------------------------------- > > Key: HIVE-9127 > URL: https://issues.apache.org/jira/browse/HIVE-9127 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: 0.14.0 > Reporter: Brock Noland > Assignee: Brock Noland > Attachments: HIVE-9127.1-spark.patch.txt, > HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt > > > In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would > fail. However, we should be able to cache these objects in RSC for split > generation. See: > https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 > how this impacts performance. > Caller ST: > {noformat} > .... > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > scala.Option.getOrElse(Option.scala:120) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > scala.Option.getOrElse(Option.scala:120) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.ShuffleDependency.<init>(Dependency.scala:79) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > scala.Option.getOrElse(Option.scala:120) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:313) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:247) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:735) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1382) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > akka.actor.Actor$class.aroundReceive(Actor.scala:465) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > akka.actor.ActorCell.invoke(ActorCell.scala:487) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > akka.dispatch.Mailbox.run(Mailbox.scala:220) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2014-12-16 14:36:22,204 INFO [stdout-redir-1]: client.SparkClientImpl > (SparkClientImpl.java:run(435)) - at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)