Paul Brenner created ZEPPELIN-5323:
--------------------------------------

             Summary: Interpreter Recovery Does Not Preserve Running Spark Jobs
                 Key: ZEPPELIN-5323
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-5323
             Project: Zeppelin
          Issue Type: Bug
            Reporter: Paul Brenner


We are using zeppelin 0.10 built from from master on March 26th, looks like the 
most recent commit was 85ed8e2e51e1ea10df38d4710216343efe218d60. We tried to 
enable interpreter recovery by adding the following to zeppelin-site.xml:
<property>
  <name>zeppelin.recovery.storage.class</name>
  
<value>org.apache.zeppelin.interpreter.recovery.FileSystemRecoveryStorage</value>
  <description>ReoveryStorage implementation based on hadoop 
FileSystem</description>
</property><property>
  <name>zeppelin.recovery.dir</name>
  <value>/user/zeppelin/recovery</value>
  <description>Location where recovery metadata is stored</description>
</property>
when we start up zeppelin we get no errors, I can start a job running and I see 
in {{/user/zeppelin/recovery/spark_paul.recovery}} that it lists 
{{spark_paul-anonymous-2G3KV92PG       10.16.41.212:34374}} so that look 
promising

 
when we stop zeppelin the interpreter process keeps running, but I see the 
following happens to the spark job
21/04/08 13:42:09 INFO yarn.YarnAllocator: Canceling requests for 262 executor 
container(s) to have a new desired total 0 executors.
21/04/08 13:42:09 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or 
disconnected! Shutting down. zeppelin-212.sec.placeiq.net:36733
21/04/08 13:42:09 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or 
disconnected! Shutting down. zeppelin-212.sec.placeiq.net:36733
21/04/08 13:42:09 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
21/04/08 13:42:09 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with SUCCEEDED
21/04/08 13:42:09 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.
21/04/08 13:42:09 INFO yarn.ApplicationMaster: Deleting staging directory 
hdfs://nameservice1/user/pbrenner/.sparkStaging/application_1617808481394_4478
21/04/08 13:42:09 INFO util.ShutdownHookManager: Shutdown hook called
 
then when we start zeppelin back up I see the following on the paragraph that 
was running:
java.lang.RuntimeException: Interpreter instance 
org.apache.zeppelin.spark.SparkInterpreter not created
        at 
org.apache.zeppelin.interpreter.remote.PooledRemoteClient.callRemoteFunction(PooledRemoteClient.java:114)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:99)
        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:281)
        at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:442)
        at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:71)
        at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
        at 
org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
        at 
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
 

It looks VERY close to working, but somehow spark jobs are still getting 
shutdown when we shutdown zepplin. Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to