Keiji Yoshida created ZEPPELIN-3077:
---------------------------------------
Summary: Cron scheduler is easy to get stuck when one of the cron
jobs takes long time or gets stuck
Key: ZEPPELIN-3077
URL: https://issues.apache.org/jira/browse/ZEPPELIN-3077
Project: Zeppelin
Issue Type: Improvement
Reporter: Keiji Yoshida
Assignee: Keiji Yoshida
The cron scheduler is easy to get stuck when one of the cron jobs takes long
time or gets stuck.
I sometimes come across the issue that the cron scheduler stops working
suddenly. According to the thread dump of ZeppelinServer, all of the
DefaultQuartzScheduler_Worker threads were waiting for the job's completion and
there was no thread to launch a new job.
Here is the contents of the thread dump:
{code}
"DefaultQuartzScheduler_Worker-10" #76 prio=5 os_prio=0 tid=0x00007fb41d3b4000
nid=0x1b521 sleeping[0x00007fb3daef1000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
- locked <0x00000000c0a7dbf0> (a java.lang.Object)
Locked ownable synchronizers:
- None
...
"DefaultQuartzScheduler_Worker-1" #67 prio=5 os_prio=0 tid=0x00007fb41d3cc800
nid=0x1b518 waiting on condition [0x00007fb3da372000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
- locked <0x00000000c0a7dd90> (a java.lang.Object)
Locked ownable synchronizers:
- None
{code}
I need to detect the cause why jobs got stuck but I think there are the
following issues which make the Zeppelin cron scheduler easy to stop working
when one of the cron jobs takes long time or gets stuck:
1. The cron scheduler always launches the next execution of the notebook even
if its previous execution has been still running.
2. The cron worker thread always waits for the notebook to finish running even
if there is no need to do so (i.e. "auto-restart interpreter on cron execution"
is not set to "on".)
Due to these issues, when there is a cron job which gets stuck / takes long
time and is scheduled to run in short cycles, all of the
DefaultQuartzScheduler_Worker threads are easy to be occupied by it and there's
no empty thread to run other new jobs.
The Zeppelin cron scheduler can be made continue working as long as possible
when there's a job which gets stuck or takes long time by fixing these issues.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)