Hi, I'm trying to build a docker image for Zeppelin in which I'll be able to use a spark standalone cluster. For this I understand that I need to include a Spark installation and point to it with the environment variable SPARK_HOME. I think I've done this correctly, but it doesn't seem to work. I hope that someone on this list can see what I'm missing.
I have a base image for Zeppelin: ```Dockerfile for zeppelin:apache FROM alpine:3.8 ARG DIST_MIRROR=http://archive.apache.org/dist/zeppelin ARG VERSION=0.8.2 ENV ZEPPELIN_HOME=/opt/zeppelin \ JAVA_HOME=/usr/lib/jvm/java-1.8-openjdk \ PATH=$PATH:/usr/lib/jvm/java-1.8-openjdk/jre/bin:/usr/lib/jvm/java-1.8-openjdk/bin RUN apk add --no-cache bash curl jq openjdk8 py3-pip && \ ln -s /usr/bin/python3 /usr/bin/python && \ mkdir -p ${ZEPPELIN_HOME} && \ curl ${DIST_MIRROR}/zeppelin-${VERSION}/zeppelin-${VERSION}-bin-all.tgz | tar xvz -C ${ZEPPELIN_HOME} && \ mv ${ZEPPELIN_HOME}/zeppelin-${VERSION}-bin-all/* ${ZEPPELIN_HOME} && \ rm -rf ${ZEPPELIN_HOME}/zeppelin-${VERSION}-bin-all && \ rm -rf *.tgz EXPOSE 8080 VOLUME ${ZEPPELIN_HOME}/logs \ ${ZEPPELIN_HOME}/notebook WORKDIR ${ZEPPELIN_HOME} CMD ./bin/zeppelin.sh run ``` >From this base image I include Spark 3.0.1 from the same bitnami image that my >Spark cluster is using. ``` Dockerfile for zeppelin:latest FROM docker.io/bitnami/spark:3.0.1-debian-10-r32 AS sparkimage FROM zeppelin:alpine COPY --from=sparkimage /opt/bitnami/spark /opt/spark RUN cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh && \ echo "export SPARK_HOME=/opt/spark" >> conf/zeppelin-env.sh && \ echo "export PYTHONPATH=\$SPARK_HOME/python/" >> conf/zeppelin-env.sh && \ echo "export PYTHONPATH=\$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:\$PYTHONPATH" >> conf/zeppelin-env.sh && \ echo "export PYSPARK_PYTHON=python3" >> conf/zeppelin-env.sh && \ echo "export PYSPARK_DRIVER_PYTHON=python3" >> conf/zeppelin-env.sh RUN cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml # From 0.8.2, Zeppelin server bind 127.0.0.1 by default instead of 0.0.0.0. # Configure zeppelin.server.addr property or ZEPPELIN_ADDR env variable to # change. ENV ZEPPELIN_ADDR="0.0.0.0" ``` Now I start zeppelin:latest and make no changes to the interpreters at all, it's not needed to produce my issue. I'd later, when starting a pyspark interpreter works, set spark.master to spark://spark-master:7077. Open a new notebook. ```example %python import pyspark print(pyspark.version.__version__) ``` prints ```output 3.0.1 ``` This is exactly what I expect. Now comes the troublesome part. ```example %pyspark print(sc) ``` prints ```output java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps; at org.apache.zeppelin.spark.BaseSparkScalaInterpreter.getUserJars(BaseSparkScalaInterpreter.scala:382) at org.apache.zeppelin.spark.SparkScala211Interpreter.open(SparkScala211Interpreter.scala:71) at org.apache.zeppelin.spark.NewSparkInterpreter.open(NewSparkInterpreter.java:102) at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:62) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69) at org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:664) at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:260) at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:194) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:616) at org.apache.zeppelin.scheduler.Job.run(Job.java:188) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` What am I missing to get the %pyspark interpreter to work? =========== Patrik Iselind, IDD If anything is unclear, don't hesitate to ask more.