Keiji Yoshida created ZEPPELIN-3077: ---------------------------------------
Summary: Cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck Key: ZEPPELIN-3077 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3077 Project: Zeppelin Issue Type: Improvement Reporter: Keiji Yoshida Assignee: Keiji Yoshida The cron scheduler is easy to get stuck when one of the cron jobs takes long time or gets stuck. I sometimes come across the issue that the cron scheduler stops working suddenly. According to the thread dump of ZeppelinServer, all of the DefaultQuartzScheduler_Worker threads were waiting for the job's completion and there was no thread to launch a new job. Here is the contents of the thread dump: {code} "DefaultQuartzScheduler_Worker-10" #76 prio=5 os_prio=0 tid=0x00007fb41d3b4000 nid=0x1b521 sleeping[0x00007fb3daef1000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) - locked <0x00000000c0a7dbf0> (a java.lang.Object) Locked ownable synchronizers: - None ... "DefaultQuartzScheduler_Worker-1" #67 prio=5 os_prio=0 tid=0x00007fb41d3cc800 nid=0x1b518 waiting on condition [0x00007fb3da372000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) - locked <0x00000000c0a7dd90> (a java.lang.Object) Locked ownable synchronizers: - None {code} I need to detect the cause why jobs got stuck but I think there are the following issues which make the Zeppelin cron scheduler easy to stop working when one of the cron jobs takes long time or gets stuck: 1. The cron scheduler always launches the next execution of the notebook even if its previous execution has been still running. 2. The cron worker thread always waits for the notebook to finish running even if there is no need to do so (i.e. "auto-restart interpreter on cron execution" is not set to "on".) Due to these issues, when there is a cron job which gets stuck / takes long time and is scheduled to run in short cycles, all of the DefaultQuartzScheduler_Worker threads are easy to be occupied by it and there's no empty thread to run other new jobs. The Zeppelin cron scheduler can be made continue working as long as possible when there's a job which gets stuck or takes long time by fixing these issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)