Andreas Weise created ZEPPELIN-3435: ---------------------------------------
Summary: Interpreter timeout lifecycle leads to interpreter process orphans Key: ZEPPELIN-3435 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3435 Project: Zeppelin Issue Type: Bug Components: zeppelin-zengine Affects Versions: 0.8.0 Reporter: Andreas Weise We have configured to Timeout our interpreters auf 60 minutes. From time to time an interpreter is not closed properly. The remote interpreter process is still alive. This behavior is non-deterministic. When timeout is reached only the following is logged: {noformat} INFO [2018-04-27 13:06:44,329] ({Timer-0} TimeoutLifecycleManager.java[run]:49) - InterpreterGroup spark:shared_process is timeout. INFO [2018-04-27 13:06:44,329] ({Timer-0} ManagedInterpreterGroup.java[close]:89) - Close InterpreterGroup: spark:shared_process INFO [2018-04-27 13:06:44,329] ({Timer-0} ManagedInterpreterGroup.java[close]:100) - Close Session: 2D8VRV5M6 for interpreter setting: spark WARN [2018-04-27 13:06:44,329] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin. spark.SparkInterpreter WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin. spark.SparkSqlInterpreter WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin. spark.DepInterpreter WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin. spark.PySparkInterpreter WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin. spark.IPySparkInterpreter WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin. spark.SparkRInterpreter INFO [2018-04-27 13:06:44,330] ({Timer-0} ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup: spark:shared_process as all the sessions are closed {noformat} For *successful* shutdown situation we also see those log entries, but they are missing in the case of this bug: {noformat} INFO [2018-04-27 13:11:20,485] ({Timer-0} ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup: spark_FKT_Reports:shared_process as all the sessions are closed INFO [2018-04-27 13:11:20,485] ({Timer-0} ManagedInterpreterGroup.java[close]:108) - Kill RemoteInterpreterProcess INFO [2018-04-27 13:11:20,485] ({Timer-0} RemoteInterpreterManagedProcess.java[stop]:220) - Kill interpreter process ERROR [2018-04-27 13:11:20,692] ({Thread-71907} RemoteInterpreterEventPoller.java[run]:257) - Can not get RemoteInterpreterEvent because it is shutdown. ERROR [2018-04-27 13:11:20,692] ({pool-30-thread-1} AppendOutputRunner.java[run]:68) - Wait for OutputBuffer queue interrupted: null WARN [2018-04-27 13:11:22,991] ({Timer-0} RemoteInterpreterManagedProcess.java[stop]:230) - ignore the exception when shutting down INFO [2018-04-27 13:11:22,993] ({Timer-0} RemoteInterpreterManagedProcess.java[stop]:238) - Remote process terminated {noformat} So in case of the Bug line 108 of ManagedInterpreterGroup is never reached. When triggering a notebook after the timeout has occured, a new additional interpreter gets started and the first one stays alive forever. Also restart the interpreter does not kill the first process. Only after restarting zeppelin, all interpreter process orphans are killed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)