[ https://issues.apache.org/jira/browse/HIVE-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108843#comment-14108843 ]
Chengxiang Li commented on HIVE-7799: ------------------------------------- Depends on the implementation of {{ResultIterator.hasNext()}}, it is designed to be a lazy iterator as it only try to call {{processNextRecord()}} while RowContainer is empty, but RowContainer does not support add more rows after already read as mentioned in previous comments. Here is what happens while different queries is executed: # For Map only job, it write map output into file directly, no need Collector in this case. # For Map Reduce job with GroupByOperator, {{HiveBaseFunctionResultList.collect()}} is triggered by {{closeRecordProcessor()}}, which is beyond the lazy-computing logic, so the ResultIterator does not do lazy computing in this case. # For Map Reduce job without GroupByOperator(like cluster by queries), ResultIterator do lazy computing, and it clear RowContainer each time befor call {{processNextRecord()}}. While read/write HiveBaseFunctionResultList in the same thread, access progress of RowContainer is like .....clear()->addRow()->first()->clear()->addRow()->first()...... so it won't violate RowContainer's access rule. But with mutli threads to read/write HiveBaseFunctionResultList, as the ScriptOperator does which venki mentioned above, it would definitely hit this JIRA issue. In my opinion, there are 2 solutions: # remove ResultIterator lazy computing feature as patch1 does. # implement a RowConatiner-like class, which support current RowContainer features. it also need to be thread-safe, and support add row after {{first()}} is already called. The second solution is quite complex, it may introduce performance degrade after support thread-safe access and write-after-read, compare with the performance upgrade of lazy-computing support, it's hardly to say whether it's worthy or not now. So I suggest we take the first solution to fix this issue, and left the possible optimization to milestone 4. > TRANSFORM failed in transform_ppr1.q[Spark Branch] > -------------------------------------------------- > > Key: HIVE-7799 > URL: https://issues.apache.org/jira/browse/HIVE-7799 > Project: Hive > Issue Type: Bug > Components: Spark > Reporter: Chengxiang Li > Assignee: Chengxiang Li > Labels: Spark-M1 > Attachments: HIVE-7799.1-spark.patch, HIVE-7799.2-spark.patch, > HIVE-7799.3-spark.patch > > > Here is the exception: > {noformat} > 2014-08-20 01:14:36,594 ERROR executor.Executor (Logging.scala:logError(96)) > - Exception in task 0.0 in stage 1.0 (TID 0) > java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.spark.HiveKVResultCache.next(HiveKVResultCache.java:113) > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.next(HiveBaseFunctionResultList.java:124) > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.next(HiveBaseFunctionResultList.java:82) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:42) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Basically, the cause is that RowContainer is misused(it's not allowed to > write once someone read row from it), i'm trying to figure out whether it's a > hive issue or just in hive on spark mode. -- This message was sent by Atlassian JIRA (v6.2#6252)