[ 
https://issues.apache.org/jira/browse/HIVE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085875#comment-14085875
 ] 

Xuefu Zhang commented on HIVE-7540:
-----------------------------------

Here is what I have in mind:
1. per-row serialization should be using Writable interface. Right now Hive (on 
Spark) is using Kryo, which is a workaround rather than a solution. 
[~hshreedharan] has some proposal for this, and in my opinion we should push 
for it.

2. it's acceptable to use Kryo to serialize non-per-row objects. For this 
particular case, RangePartitioner should allow kryo if there is a problem with 
Writable. Especially, Hive already sets serializer=kryo.

3. java serialization should be avoided by all means. Obviously, Spark is tied 
with this very much, but Hive should not rely on that.

4. the proposed workaround mentioned above seems solving the problem of #2 
while putting a per-row penalty. I'm concerned about this.

In general, I like the idea of putting in workaround to allow the project to 
proceed, but I'd also like the idea of not hiding the underneath real problem 
just because of the existence of an unacceptable workaround.



> NotSerializableException encountered when using sortByKey transformation
> ------------------------------------------------------------------------
>
>                 Key: HIVE-7540
>                 URL: https://issues.apache.org/jira/browse/HIVE-7540
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>         Environment: Spark-1.0.1
>            Reporter: Rui Li
>
> This exception is thrown when sortByKey is used as the shuffle transformation 
> between MapWork and ReduceWork:
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: 
> org.apache.hadoop.io.BytesWritable
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>     at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:772)
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:715)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:719)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:718)
>     at scala.collection.immutable.List.foreach(List.scala:318)
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:718)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:699)
> …
> {quote}
>  The root cause is that the RangePartitioner used by sortByKey contains 
> rangeBounds: Array[BytesWritable], which is considered not serializable in 
> spark.
> A workaround to this issue is to set the number of partitions to 1 when 
> calling sortByKey, in which case the rangeBounds will be just an empty array.
> NO PRECOMMIT TESTS. This is for spark branch only.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to