Keiji Yoshida created ZEPPELIN-3077:
---------------------------------------

             Summary: Cron scheduler is easy to get stuck when one of the cron 
jobs takes long time or gets stuck
                 Key: ZEPPELIN-3077
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3077
             Project: Zeppelin
          Issue Type: Improvement
            Reporter: Keiji Yoshida
            Assignee: Keiji Yoshida


The cron scheduler is easy to get stuck when one of the cron jobs takes long 
time or gets stuck.

I sometimes come across the issue that the cron scheduler stops working 
suddenly. According to the thread dump of ZeppelinServer, all of the 
DefaultQuartzScheduler_Worker threads were waiting for the job's completion and 
there was no thread to launch a new job.

Here is the contents of the thread dump:

{code}
"DefaultQuartzScheduler_Worker-10" #76 prio=5 os_prio=0 tid=0x00007fb41d3b4000 
nid=0x1b521 sleeping[0x00007fb3daef1000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7dbf0> (a java.lang.Object)
 
   Locked ownable synchronizers:
        - None
 
...
 
"DefaultQuartzScheduler_Worker-1" #67 prio=5 os_prio=0 tid=0x00007fb41d3cc800 
nid=0x1b518 waiting on condition [0x00007fb3da372000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
        - locked <0x00000000c0a7dd90> (a java.lang.Object)
 
   Locked ownable synchronizers:
        - None
{code}

I need to detect the cause why jobs got stuck but I think there are the 
following issues which make the Zeppelin cron scheduler easy to stop working 
when one of the cron jobs takes long time or gets stuck:

1. The cron scheduler always launches the next execution of the notebook even 
if its previous execution has been still running.
2. The cron worker thread always waits for the notebook to finish running even 
if there is no need to do so (i.e. "auto-restart interpreter on cron execution" 
is not set to "on".)

Due to these issues, when there is a cron job which gets stuck / takes long 
time and is scheduled to run in short cycles, all of the 
DefaultQuartzScheduler_Worker threads are easy to be occupied by it and there's 
no empty thread to run other new jobs.

The Zeppelin cron scheduler can be made continue working as long as possible 
when there's a job which gets stuck or takes long time by fixing these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to