Andreas Weise created ZEPPELIN-3435:
---------------------------------------

             Summary: Interpreter timeout lifecycle leads to interpreter 
process orphans
                 Key: ZEPPELIN-3435
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3435
             Project: Zeppelin
          Issue Type: Bug
          Components: zeppelin-zengine
    Affects Versions: 0.8.0
            Reporter: Andreas Weise


We have configured to Timeout our interpreters auf 60 minutes. From time to 
time an interpreter is not closed properly. The remote interpreter process is 
still alive. This behavior is non-deterministic. 

When timeout is reached only the following is logged:
{noformat}
INFO [2018-04-27 13:06:44,329] ({Timer-0} TimeoutLifecycleManager.java[run]:49) 
- InterpreterGroup spark:shared_process is timeout.
INFO [2018-04-27 13:06:44,329] ({Timer-0} 
ManagedInterpreterGroup.java[close]:89) - Close InterpreterGroup: 
spark:shared_process
INFO [2018-04-27 13:06:44,329] ({Timer-0} 
ManagedInterpreterGroup.java[close]:100) - Close Session: 2D8VRV5M6 for 
interpreter setting: spark
WARN [2018-04-27 13:06:44,329] ({Timer-0} RemoteInterpreter.java[close]:199) - 
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.SparkInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - 
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.SparkSqlInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - 
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.DepInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - 
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.PySparkInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - 
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.IPySparkInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - 
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.SparkRInterpreter
INFO [2018-04-27 13:06:44,330] ({Timer-0} 
ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup: 
spark:shared_process as all the
sessions are closed
{noformat}
For *successful* shutdown situation we also see those log entries, but they are 
missing in the case of this bug:
{noformat}
INFO [2018-04-27 13:11:20,485] ({Timer-0} 
ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup: 
spark_FKT_Reports:shared_process as all the sessions are closed
INFO [2018-04-27 13:11:20,485] ({Timer-0} 
ManagedInterpreterGroup.java[close]:108) - Kill RemoteInterpreterProcess
INFO [2018-04-27 13:11:20,485] ({Timer-0} 
RemoteInterpreterManagedProcess.java[stop]:220) - Kill interpreter process
ERROR [2018-04-27 13:11:20,692] ({Thread-71907} 
RemoteInterpreterEventPoller.java[run]:257) - Can not get 
RemoteInterpreterEvent because it is shutdown.
ERROR [2018-04-27 13:11:20,692] ({pool-30-thread-1} 
AppendOutputRunner.java[run]:68) - Wait for OutputBuffer queue interrupted: null
WARN [2018-04-27 13:11:22,991] ({Timer-0} 
RemoteInterpreterManagedProcess.java[stop]:230) - ignore the exception when 
shutting down
INFO [2018-04-27 13:11:22,993] ({Timer-0} 
RemoteInterpreterManagedProcess.java[stop]:238) - Remote process terminated

{noformat}
So in case of the Bug line 108 of ManagedInterpreterGroup is never reached.

When triggering a notebook after the timeout has occured, a new additional 
interpreter gets started and the first one stays alive forever.

Also restart the interpreter does not kill the first process.

Only after restarting zeppelin, all interpreter process orphans are killed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to