Hi everyone, I'm seeing the below exception coming out of Spark 1.0.1 when I call it from my application. I can't share the source to that application, but the quick gist is that it uses Spark's Java APIs to read from Avro files in HDFS, do processing, and write back to Avro files. It does this by receiving a REST call, then spinning up a new JVM as the driver application that connects to Spark. I'm using CDH4.4.0 and have enabled Kryo and also speculation. The cluster is running in standalone mode on a 6 node cluster in AWS (not using Spark's EC2 scripts though).
The below stacktraces are reliably reproduceable on every run of the job. The issue seems to be that on deserialization of a task result on the driver, Kryo spits up while reading the ClassManifest. I've tried swapping in Kryo 2.23.1 rather than 2.21 (2.22 had some backcompat issues) but had the same error. Any ideas on what can be done here? Thanks! Andrew In the driver (Kryo exception while deserializing a DirectTaskResult): INFO | jvm 1 | 2014/07/30 20:52:52 | 20:52:52.667 [Result resolver thread-0] ERROR o.a.spark.scheduler.TaskResultGetter - Exception while getting task result INFO | jvm 1 | 2014/07/30 20:52:52 | com.esotericsoftware.kryo.KryoException: Buffer underflow. INFO | jvm 1 | 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.io.Input.require(Input.java:156) ~[kryo-2.21.jar:na] INFO | jvm 1 | 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337) ~[kryo-2.21.jar:na] INFO | jvm 1 | 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:762) ~[kryo-2.21.jar:na] INFO | jvm 1 | 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:624) ~[kryo-2.21.jar:na] INFO | jvm 1 | 2014/07/30 20:52:52 | at com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:26) ~[chill_2.10-0.3.6.jar:0.3.6] INFO | jvm 1 | 2014/07/30 20:52:52 | at com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:19) ~[chill_2.10-0.3.6.jar:0.3.6] INFO | jvm 1 | 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) ~[kryo-2.21.jar:na] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:147) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1 | 2014/07/30 20:52:52 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] INFO | jvm 1 | 2014/07/30 20:52:52 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] INFO | jvm 1 | 2014/07/30 20:52:52 | at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] In the DAGScheduler (job gets aborted): org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: Buffer underflow. at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In an Executor (running tasks get killed): 14/07/29 22:57:38 INFO broadcast.HttpBroadcast: Started reading broadcast variable 0 14/07/29 22:57:39 INFO executor.Executor: Executor is trying to kill task 153 14/07/29 22:57:39 INFO executor.Executor: Executor is trying to kill task 147 14/07/29 22:57:39 INFO executor.Executor: Executor is trying to kill task 141 14/07/29 22:57:39 INFO executor.Executor: Executor is trying to kill task 135 14/07/29 22:57:39 INFO executor.Executor: Executor is trying to kill task 150 14/07/29 22:57:39 INFO executor.Executor: Executor is trying to kill task 144 14/07/29 22:57:39 INFO executor.Executor: Executor is trying to kill task 138 14/07/29 22:57:39 INFO storage.MemoryStore: ensureFreeSpace(241733) called with curMem=0, maxMem=30870601728 14/07/29 22:57:39 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 236.1 KB, free 28.8 GB) 14/07/29 22:57:39 INFO broadcast.HttpBroadcast: Reading broadcast variable 0 took 0.91790748 s 14/07/29 22:57:39 INFO storage.BlockManager: Found block broadcast_0 locally 14/07/29 22:57:39 INFO storage.BlockManager: Found block broadcast_0 locally 14/07/29 22:57:39 INFO storage.BlockManager: Found block broadcast_0 locally 14/07/29 22:57:39 INFO storage.BlockManager: Found block broadcast_0 locally 14/07/29 22:57:39 INFO storage.BlockManager: Found block broadcast_0 locally 14/07/29 22:57:39 INFO storage.BlockManager: Found block broadcast_0 locally 14/07/29 22:57:40 ERROR executor.Executor: Exception in task ID 135 org.apache.spark.TaskKilledException at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:174) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/07/29 22:57:40 ERROR executor.Executor: Exception in task ID 144 org.apache.spark.TaskKilledException at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:174) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/07/29 22:57:40 ERROR executor.Executor: Exception in task ID 150 org.apache.spark.TaskKilledException at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:174) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/07/29 22:57:40 ERROR executor.Executor: Exception in task ID 138 org.apache.spark.TaskKilledException at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:174) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/07/29 22:57:40 ERROR executor.Executor: Exception in task ID 141 org.apache.spark.TaskKilledException at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:174) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)