Ruslan Dautkhanov created ZEPPELIN-1984: -------------------------------------------
Summary: Zeppelin Server doesn't catch all exception when launching a new interpreter process Key: ZEPPELIN-1984 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1984 Project: Zeppelin Issue Type: Bug Components: zeppelin-interpreter, zeppelin-server Affects Versions: 0.7.0 Environment: Zeppelin server from a month old master snapshot Reporter: Ruslan Dautkhanov We saw below exception stack when Zeppelin server tries to start a new interpreter process, for example, Spark interpreter. It was really hard to debug and the only way to capture real root cause, was to add {code} LOG="/tmp/interpreter.sh-$$.log" date >> $LOG set -x exec >> $LOG exec 2>&1 {code} to $zeppelinhome/bin/interpreter.sh file so all stdout and stderr from the interpreter.sh would go to that file. So it showed real problem {noformat} Exception in thread "main" org.apache.spark.SparkException: Keytab file: /home/<username>/.kt does not exist at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158) ... {noformat} while all other Zeppelin logs and note output was showing misleading "Connection refused" - see below stack {noformat} ERROR [2017-01-18 16:54:38,533] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:1645) - Error org.apache.zeppelin.interpreter.InterpreterException: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:232) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:400) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:105) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:316) at org.apache.zeppelin.scheduler.Job.run(Job.java:176) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ... {noformat} The issue might be that after interpreter.sh is started and exits right away - https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L121 this does not get captured anywhere. The only sign you'll see on Zeppelin side is "Connection refused" as Zeppelin wouldn't be able to connect to a new interpreter process. We saw different root causes (above error from spark-submit that keytab file doesn't exist is just one of them), and every time we had to add tracing into interpreter.sh to capture real problem. We think there are two possible ways to improve that: 1) capture fact that interpreter.sh bails out (and don't try to connect in https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L132 as it'll produce expected "Connection refused") 2) if one point 1) isn't possible for some reason (although I don't why that would be) - at least capture errors produced by interpreter.sh so error stack in Zeppelin log files and paragraph output that kicked off interpreter start would have some meaningful information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)