Miguel created ZEPPELIN-2719:
--------------------------------

             Summary: Can't get Spark interpreter to work with Cloudera's YARN 
cluster
                 Key: ZEPPELIN-2719
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-2719
             Project: Zeppelin
          Issue Type: Bug
          Components: Interpreters
    Affects Versions: 0.7.2
         Environment: OS: Ubuntu 14.04.5 LTS
JRE: 1.7.0_67
Cloudera CDH 5.9.1
Hadoop 2.6.0-cdh5.9.1 in HA mode
Spark 1.6.1 running in a YARN cluster in HA mode
Scala 2.10
Kerberos
            Reporter: Miguel


Hi,

I'm having problems getting the Spark interpreter to work. Every time I try to 
run it I get a connection refused error:
{noformat}
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        [...]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

I've spent a few days trying to debug the issue and I'm at a point where I'm 
running out of ideas, so any help is greatly appreciated.

I have built Zeppelin for my environment using:
{noformat}
mvn clean package -Pspark-1.6 -Dhadoop.version=2.6.0-cdh5.9.1 -Phadoop-2.6 
-Pvendor-repo -Pscala-2.10 -Pbuild-distr -DskipTests
{noformat}

And have the following configuration in zeppelin-env.sh
{noformat}
export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera/jre
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
{noformat}

I read in a different issue that lowering the memory settings could help, so I 
added:
{noformat}
export ZEPPELIN_JAVA_OPTS=" -Dspark.executor.memory=1g -Dspark.cores.max=2"
export ZEPPELIN_MEM=" -Xms512m -Xmx1024m -XX:MaxPermSize=256m"
export ZEPPELIN_INTP_MEM=" -Xms512m -Xmx1024m -XX:MaxPermSize=256m"
export SPARK_SUBMIT_OPTIONS=" --driver-memory 512M --executor-memory 1G"
{noformat}

But it doesn't seem to change anything, I get the same error.

The Spark interpreter is configured as follows
{noformat}
master: yarn-client
spark.app.name: Zeppelin
spark.yarn.keytab:      /opt/zeppelin/zeppelin.keytab
spark.yarn.principal:   zeppelin@<REALM>
zeppelin.dep.additionalRemoteRepository:        
spark-packages,http://dl.bintray.com/spark-packages/maven,false;
zeppelin.dep.localrepo: local-repo
zeppelin.pyspark.python:        python
zeppelin.spark.concurrentSQL:   false
zeppelin.spark.importImplicit:  true
zeppelin.spark.maxResult:       1000
zeppelin.spark.printREPLOutput: true
zeppelin.spark.sql.stacktrace:  false
zeppelin.spark.useHiveContext:  true
{noformat}

The zeppelin Kerberos principal and keytab should be ok, I'm using them with 
Livy and it works.

Here are the relevant lines from zeppelin-zeppelin-<hostname>.log
{noformat}
 INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} 
InterpreterFactory.java[createInterpretersForNote]:188) - Create interpreter 
instance spark for note 2CGW3RAGX
 INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} 
InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter 
org.apache.zeppelin.spark.SparkInterpreter 799822533 created
 INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} 
InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter 
org.apache.zeppelin.spark.SparkSqlInterpreter 1517165558 created
 INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} 
InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter 
org.apache.zeppelin.spark.DepInterpreter 1928192475 created
 INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} 
InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter 
org.apache.zeppelin.spark.PySparkInterpreter 1602694095 created
 INFO [2017-07-04 08:12:20,051] ({pool-2-thread-2} 
SchedulerFactory.java[jobStarted]:131) - Job paragraph_1495010482434_695017792 
started by scheduler 
org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session1222353445
 INFO [2017-07-04 08:12:20,052] ({pool-2-thread-2} Paragraph.java[jobRun]:362) 
- run paragraph 20170517-084122_2115191800 using spark 
org.apache.zeppelin.interpreter.LazyOpenInterpreter@2fac52c5
 INFO [2017-07-04 08:12:20,060] ({pool-2-thread-2} 
RemoteInterpreterManagedProcess.java[start]:126) - Run interpreter process 
[/opt/zeppelin/zeppelin/bin/interpreter.sh, -d, 
/opt/zeppelin/zeppelin/interpreter/spark, -p, 52698, -l, 
/opt/zeppelin/zeppelin/local-repo/2CJKGGV2U]
ERROR [2017-07-04 08:12:50,124] ({Thread-36} 
RemoteScheduler.java[getStatus]:256) - Can't get status information
org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException: java.net.ConnectException: 
Connection refused
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
        at 
org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92)
        at 
org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:254)
        at 
org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.run(RemoteScheduler.java:212)
Caused by: org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
        at org.apache.thrift.transport.TSocket.open(TSocket.java:187)
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
        ... 8 more
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at org.apache.thrift.transport.TSocket.open(TSocket.java:182)
        ... 9 more
ERROR [2017-07-04 08:12:50,124] ({pool-2-thread-2} 
RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: 
org.apache.zeppelin.spark.SparkInterpreter. Remove it from interpreterGroup
ERROR [2017-07-04 08:12:50,125] ({Thread-35} 
RemoteInterpreterEventPoller.java[run]:102) - Can't get RemoteInterpreterEvent
org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException: java.net.ConnectException: 
Connection refused
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
        at 
org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterEventPoller.run(RemoteInterpreterEventPoller.java:100)
Caused by: org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
        at org.apache.thrift.transport.TSocket.open(TSocket.java:187)
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
        ... 7 more
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at org.apache.thrift.transport.TSocket.open(TSocket.java:182)
        ... 8 more
ERROR [2017-07-04 08:12:50,125] ({pool-2-thread-2} 
RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: 
org.apache.zeppelin.spark.SparkSqlInterpreter. Remove it from interpreterGroup
ERROR [2017-07-04 08:12:50,125] ({pool-2-thread-2} 
RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: 
org.apache.zeppelin.spark.DepInterpreter. Remove it from interpreterGroup
ERROR [2017-07-04 08:12:50,126] ({pool-2-thread-2} 
RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: 
org.apache.zeppelin.spark.PySparkInterpreter. Remove it from interpreterGroup
ERROR [2017-07-04 08:12:50,126] ({pool-2-thread-2} Job.java[run]:188) - Job 
failed
org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException: java.net.ConnectException: 
Connection refused
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:434)
        at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:106)
        at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:387)
        at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
        at 
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException: java.net.ConnectException: 
Connection refused
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
        at 
org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
        at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:432)
        ... 11 more
Caused by: org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
        at org.apache.thrift.transport.TSocket.open(TSocket.java:187)
        at 
org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
        ... 18 more
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at org.apache.thrift.transport.TSocket.open(TSocket.java:182)
        ... 19 more
{noformat}

There's no zeppelin-interpreter-spark-zeppelin-hostname.log being created.
The only error I can see in the YARN logs are these:
{noformat}
log4j:ERROR Could not read configuration file from URL 
[file:/opt/zeppelin/zeppelin/conf/log4j.properties].
java.io.FileNotFoundException: /opt/zeppelin/zeppelin/conf/log4j.properties (No 
such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:146)
        at java.io.FileInputStream.<init>(FileInputStream.java:101)
        at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
        at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
        at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
        at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
        at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogging(ApplicationMaster.scala:635)
        at 
org.apache.spark.Logging$class.initializeLogIfNecessary(Logging.scala:106)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogIfNecessary(ApplicationMaster.scala:635)
        at org.apache.spark.Logging$class.log(Logging.scala:50)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:635)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:649)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
log4j:ERROR Ignoring configuration file 
[file:/opt/zeppelin/zeppelin/conf/log4j.properties].
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
[...]
17/07/04 08:17:03 ERROR ApplicationMaster: SparkContext did not initialize 
after waiting for 100000 ms. Please check earlier log output for errors. 
Failing the application.
17/07/04 08:17:03 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
13, (reason: Timed out waiting for SparkContext.)
17/07/04 08:17:03 INFO ApplicationMaster: Unregistering ApplicationMaster with 
FAILED (diag message: Timed out waiting for SparkContext.)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to