[ 
https://issues.apache.org/jira/browse/HIVE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085889#comment-14085889
 ] 

Brock Noland commented on HIVE-7540:
------------------------------------

bq. the proposed workaround mentioned above seems solving the problem of #2 
while putting a per-row penalty. I'm concerned about this.

I too am concerned about this as it's potentially an expensive workaround. 
Implementing the "expensive" workaround here simply unblocks us, it doesn't 
"hide" the problem. We can open another JIRA and discuss the final solution 
while being unblocked here and giving us more time to gather information. 
That's valuable IMO.

> NotSerializableException encountered when using sortByKey transformation
> ------------------------------------------------------------------------
>
>                 Key: HIVE-7540
>                 URL: https://issues.apache.org/jira/browse/HIVE-7540
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>         Environment: Spark-1.0.1
>            Reporter: Rui Li
>
> This exception is thrown when sortByKey is used as the shuffle transformation 
> between MapWork and ReduceWork:
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: 
> org.apache.hadoop.io.BytesWritable
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>     at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:772)
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:715)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:719)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:718)
>     at scala.collection.immutable.List.foreach(List.scala:318)
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:718)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:699)
> …
> {quote}
>  The root cause is that the RangePartitioner used by sortByKey contains 
> rangeBounds: Array[BytesWritable], which is considered not serializable in 
> spark.
> A workaround to this issue is to set the number of partitions to 1 when 
> calling sortByKey, in which case the rangeBounds will be just an empty array.
> NO PRECOMMIT TESTS. This is for spark branch only.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to