Miguel created ZEPPELIN-2719: -------------------------------- Summary: Can't get Spark interpreter to work with Cloudera's YARN cluster Key: ZEPPELIN-2719 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2719 Project: Zeppelin Issue Type: Bug Components: Interpreters Affects Versions: 0.7.2 Environment: OS: Ubuntu 14.04.5 LTS JRE: 1.7.0_67 Cloudera CDH 5.9.1 Hadoop 2.6.0-cdh5.9.1 in HA mode Spark 1.6.1 running in a YARN cluster in HA mode Scala 2.10 Kerberos Reporter: Miguel
Hi, I'm having problems getting the Spark interpreter to work. Every time I try to run it I get a connection refused error: {noformat} java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) [...] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} I've spent a few days trying to debug the issue and I'm at a point where I'm running out of ideas, so any help is greatly appreciated. I have built Zeppelin for my environment using: {noformat} mvn clean package -Pspark-1.6 -Dhadoop.version=2.6.0-cdh5.9.1 -Phadoop-2.6 -Pvendor-repo -Pscala-2.10 -Pbuild-distr -DskipTests {noformat} And have the following configuration in zeppelin-env.sh {noformat} export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera/jre export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop export HADOOP_CONF_DIR=/etc/hadoop/conf {noformat} I read in a different issue that lowering the memory settings could help, so I added: {noformat} export ZEPPELIN_JAVA_OPTS=" -Dspark.executor.memory=1g -Dspark.cores.max=2" export ZEPPELIN_MEM=" -Xms512m -Xmx1024m -XX:MaxPermSize=256m" export ZEPPELIN_INTP_MEM=" -Xms512m -Xmx1024m -XX:MaxPermSize=256m" export SPARK_SUBMIT_OPTIONS=" --driver-memory 512M --executor-memory 1G" {noformat} But it doesn't seem to change anything, I get the same error. The Spark interpreter is configured as follows {noformat} master: yarn-client spark.app.name: Zeppelin spark.yarn.keytab: /opt/zeppelin/zeppelin.keytab spark.yarn.principal: zeppelin@<REALM> zeppelin.dep.additionalRemoteRepository: spark-packages,http://dl.bintray.com/spark-packages/maven,false; zeppelin.dep.localrepo: local-repo zeppelin.pyspark.python: python zeppelin.spark.concurrentSQL: false zeppelin.spark.importImplicit: true zeppelin.spark.maxResult: 1000 zeppelin.spark.printREPLOutput: true zeppelin.spark.sql.stacktrace: false zeppelin.spark.useHiveContext: true {noformat} The zeppelin Kerberos principal and keytab should be ok, I'm using them with Livy and it works. Here are the relevant lines from zeppelin-zeppelin-<hostname>.log {noformat} INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:188) - Create interpreter instance spark for note 2CGW3RAGX INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.SparkInterpreter 799822533 created INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.SparkSqlInterpreter 1517165558 created INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.DepInterpreter 1928192475 created INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.PySparkInterpreter 1602694095 created INFO [2017-07-04 08:12:20,051] ({pool-2-thread-2} SchedulerFactory.java[jobStarted]:131) - Job paragraph_1495010482434_695017792 started by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session1222353445 INFO [2017-07-04 08:12:20,052] ({pool-2-thread-2} Paragraph.java[jobRun]:362) - run paragraph 20170517-084122_2115191800 using spark org.apache.zeppelin.interpreter.LazyOpenInterpreter@2fac52c5 INFO [2017-07-04 08:12:20,060] ({pool-2-thread-2} RemoteInterpreterManagedProcess.java[start]:126) - Run interpreter process [/opt/zeppelin/zeppelin/bin/interpreter.sh, -d, /opt/zeppelin/zeppelin/interpreter/spark, -p, 52698, -l, /opt/zeppelin/zeppelin/local-repo/2CJKGGV2U] ERROR [2017-07-04 08:12:50,124] ({Thread-36} RemoteScheduler.java[getStatus]:256) - Can't get status information org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92) at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:254) at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.run(RemoteScheduler.java:212) Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.thrift.transport.TSocket.open(TSocket.java:187) at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) ... 8 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) ... 9 more ERROR [2017-07-04 08:12:50,124] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.SparkInterpreter. Remove it from interpreterGroup ERROR [2017-07-04 08:12:50,125] ({Thread-35} RemoteInterpreterEventPoller.java[run]:102) - Can't get RemoteInterpreterEvent org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterEventPoller.run(RemoteInterpreterEventPoller.java:100) Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.thrift.transport.TSocket.open(TSocket.java:187) at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) ... 7 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) ... 8 more ERROR [2017-07-04 08:12:50,125] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.SparkSqlInterpreter. Remove it from interpreterGroup ERROR [2017-07-04 08:12:50,125] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.DepInterpreter. Remove it from interpreterGroup ERROR [2017-07-04 08:12:50,126] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.PySparkInterpreter. Remove it from interpreterGroup ERROR [2017-07-04 08:12:50,126] ({pool-2-thread-2} Job.java[run]:188) - Job failed org.apache.zeppelin.interpreter.InterpreterException: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:434) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:106) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:387) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:432) ... 11 more Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.thrift.transport.TSocket.open(TSocket.java:187) at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) ... 18 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at org.apache.thrift.transport.TSocket.open(TSocket.java:182) ... 19 more {noformat} There's no zeppelin-interpreter-spark-zeppelin-hostname.log being created. The only error I can see in the YARN logs are these: {noformat} log4j:ERROR Could not read configuration file from URL [file:/opt/zeppelin/zeppelin/conf/log4j.properties]. java.io.FileNotFoundException: /opt/zeppelin/zeppelin/conf/log4j.properties (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:146) at java.io.FileInputStream.<init>(FileInputStream.java:101) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) at org.apache.log4j.LogManager.<clinit>(LogManager.java:127) at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121) at org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogging(ApplicationMaster.scala:635) at org.apache.spark.Logging$class.initializeLogIfNecessary(Logging.scala:106) at org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogIfNecessary(ApplicationMaster.scala:635) at org.apache.spark.Logging$class.log(Logging.scala:50) at org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:635) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:649) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) log4j:ERROR Ignoring configuration file [file:/opt/zeppelin/zeppelin/conf/log4j.properties]. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties [...] 17/07/04 08:17:03 ERROR ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application. 17/07/04 08:17:03 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.) 17/07/04 08:17:03 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Timed out waiting for SparkContext.) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)