Paul Brenner created ZEPPELIN-5323: -------------------------------------- Summary: Interpreter Recovery Does Not Preserve Running Spark Jobs Key: ZEPPELIN-5323 URL: https://issues.apache.org/jira/browse/ZEPPELIN-5323 Project: Zeppelin Issue Type: Bug Reporter: Paul Brenner
We are using zeppelin 0.10 built from from master on March 26th, looks like the most recent commit was 85ed8e2e51e1ea10df38d4710216343efe218d60. We tried to enable interpreter recovery by adding the following to zeppelin-site.xml: <property> <name>zeppelin.recovery.storage.class</name> <value>org.apache.zeppelin.interpreter.recovery.FileSystemRecoveryStorage</value> <description>ReoveryStorage implementation based on hadoop FileSystem</description> </property><property> <name>zeppelin.recovery.dir</name> <value>/user/zeppelin/recovery</value> <description>Location where recovery metadata is stored</description> </property> when we start up zeppelin we get no errors, I can start a job running and I see in {{/user/zeppelin/recovery/spark_paul.recovery}} that it lists {{spark_paul-anonymous-2G3KV92PG 10.16.41.212:34374}} so that look promising when we stop zeppelin the interpreter process keeps running, but I see the following happens to the spark job 21/04/08 13:42:09 INFO yarn.YarnAllocator: Canceling requests for 262 executor container(s) to have a new desired total 0 executors. 21/04/08 13:42:09 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. zeppelin-212.sec.placeiq.net:36733 21/04/08 13:42:09 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. zeppelin-212.sec.placeiq.net:36733 21/04/08 13:42:09 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 21/04/08 13:42:09 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 21/04/08 13:42:09 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 21/04/08 13:42:09 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://nameservice1/user/pbrenner/.sparkStaging/application_1617808481394_4478 21/04/08 13:42:09 INFO util.ShutdownHookManager: Shutdown hook called then when we start zeppelin back up I see the following on the paragraph that was running: java.lang.RuntimeException: Interpreter instance org.apache.zeppelin.spark.SparkInterpreter not created at org.apache.zeppelin.interpreter.remote.PooledRemoteClient.callRemoteFunction(PooledRemoteClient.java:114) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:99) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:281) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:442) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:71) at org.apache.zeppelin.scheduler.Job.run(Job.java:172) at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) It looks VERY close to working, but somehow spark jobs are still getting shutdown when we shutdown zepplin. Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)