Hi Spark users,
I seem to be having this consistent error which I have been trying to reproduce
and narrow down the problem. I've been running a PySpark application on Spark
1.2 reading avro files from Hadoop. I was consistently seeing the following
error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot
communicate with client version 4
After some searching, I noticed that this most likely meant my hadoop versions
were mismatched. I had the following versions at the time:
* Hadoop: hadoop-2.0.0-cdh4.7.0
* Spark: spark-1.2.0-bin-cdh4.2.0
In the past, I've never had a problem with this setup for Spark 1.1.1 or Spark
1.0.2. I figured it was worth me rebuilding Spark in case I was wrong about
versions. To rebuild my Spark, I ran this command on the v1.2.0 tag:
./make-distribution.sh -Dhadoop.version=2.0.0-cdh4.7.0
I then retried my previously mentioned application with this new build of
Spark. Same error.
To narrow down the problem some more, I figured I should try out the example
which comes with spark which allows you to load an avro file. I ran the below
command (I know it uses a deprecated way of passing jars to the driver
classpath):
SPARK_CLASSPATH="/path/to/avro-mapred-1.7.4-hadoop2.jar:lib/spark-examples-1.2.0-hadoop2.0.0-cdh4.7.0.jar:$SPARK_CLASSPATH"
bin/spark-submit ./examples/src/main/python/avro_inputformat.py
"hdfs://localhost:8020/path/to/file.avro"
I ended up with the same error. The full stacktrace is below.
Traceback (most recent call last):
File "/git/spark/dist/./examples/src/main/python/avro_inputformat.py", line
77, in <module>
conf=conf)
File "/git/spark/dist/python/pyspark/context.py", line 503, in
newAPIHadoopFile
jconf, batchSize)
File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
File "/git/spark/dist/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot
communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:774)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:514)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:469)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:724)
I could foresee that possibly my avro-mapred jar is a problem. However it is
also hadoop 2 and hasn't had a problem in the past, so I don't believe it is
likely.
Any suggestions for debugging, or more direct help into what is probably wrong
would be much appreciated.
Michael