Ruslan Dautkhanov created ZEPPELIN-1984:
-------------------------------------------

             Summary: Zeppelin Server doesn't catch all exception when 
launching a new interpreter process
                 Key: ZEPPELIN-1984
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1984
             Project: Zeppelin
          Issue Type: Bug
          Components: zeppelin-interpreter, zeppelin-server
    Affects Versions: 0.7.0
         Environment: Zeppelin server from a month old master snapshot
            Reporter: Ruslan Dautkhanov


We saw below exception stack when Zeppelin server tries to start a new 
interpreter process, for example, Spark interpreter. It was really hard to 
debug and the only way to capture real root cause, was to add 
{code}
LOG="/tmp/interpreter.sh-$$.log"
date >> $LOG
set -x
exec >> $LOG
exec 2>&1
{code} to $zeppelinhome/bin/interpreter.sh file
so all stdout and stderr from the interpreter.sh would go to that file.
So it showed real problem 
{noformat}
Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
/home/<username>/.kt does not exist
        at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
...
{noformat}
while all other Zeppelin logs and note output was showing misleading 
"Connection refused" - see below stack

{noformat}
ERROR [2017-01-18 16:54:38,533] ({pool-2-thread-2} 
NotebookServer.java[afterStatusChange]:1645) - Error
org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException: java.net.ConnectException: 
Connection refused
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:232)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:400)
        at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:105)
        at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:316)
        at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
        at 
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
...
{noformat}

The issue might be that after interpreter.sh is started and exits right away - 
https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L121
 
this does not get captured anywhere. The only sign you'll see on Zeppelin side 
is "Connection refused" as Zeppelin wouldn't be able to connect to a new 
interpreter process. We saw different root causes (above error from 
spark-submit that keytab file doesn't exist is just one of them), and every 
time we had to add tracing into interpreter.sh to capture real problem.

We think there are two possible ways to improve that:
1) capture fact that interpreter.sh bails out (and don't try to connect in 
https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L132
 as it'll produce expected "Connection refused")
2) if one point 1) isn't possible for some reason (although I don't why that 
would be) - at least capture errors produced by interpreter.sh so error stack 
in Zeppelin log files and paragraph output that kicked off interpreter start 
would have some meaningful information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to