Hi, I have a test-running zeppelin connected to my spark cluster and cluster
setup like this:
Spark: spark-1.4.1-bin-hadoop2.6
Hadoop: hadoop-2.7.1
Zeppelin installed as: mvn clean install -Pspark-1.4 -Dhadoop.version=2.6.0
-Phadoop-2.6 -DskipTests
and ran a test code like this:
val textFile =
sc.textFile("hdfs://analytics-master:9000/user/kanizsa.lab/gutenberg")
print textFile.count()
val textFile = sc.textFile("README.md")
print textFile.count()
this code works well with scala interpreter but I'd encountered error when
executing similar code as %pyspark.
I'd also ran a code in pyspark command line shell but I works like a charm. no
error like below:
How to resolve this error? It seems something inside of zeppelin not compatible
with my spark cluster...
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID
52, 10.32.10.97): java.io.InvalidClassException:
org.apache.spark.api.python.PythonRDD; local class incompatible: stream
classdesc serialVersionUID = 1521627685947625661, local class serialVersionUID
= -2629548733386970437
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred
while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.\n',
JavaObject id=o93), <traceback object at 0x1bde290>)