GitHub user kjmrknsn opened a pull request:

    https://github.com/apache/zeppelin/pull/2687

    [ZEPPELIN-3077] Cron scheduler is easy to get stuck when one of the cron 
jobs takes long time or gets stuck

    ### What is this PR for?
    The cron scheduler is easy to get stuck when one of the cron jobs takes 
long time or gets stuck.
    
    I sometimes come across the issue that the cron scheduler stops working 
suddenly. According to the thread dump of ZeppelinServer, all of the 
DefaultQuartzScheduler_Worker threads were waiting for the job's completion and 
there was no thread to launch a new job.
    
    Here is the contents of the thread dump:
    
    ```
    "DefaultQuartzScheduler_Worker-10" #76 prio=5 os_prio=0 
tid=0x00007fb41d3b4000 nid=0x1b521 sleeping[0x00007fb3daef1000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
            at java.lang.Thread.sleep(Native Method)
            at 
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
            at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
            at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
            - locked <0x00000000c0a7dbf0> (a java.lang.Object)
    
       Locked ownable synchronizers:
            - None
    
    "DefaultQuartzScheduler_Worker-9" #75 prio=5 os_prio=0 
tid=0x00007fb41d3b2000 nid=0x1b520 waiting on condition [0x00007fb3daff2000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
            at java.lang.Thread.sleep(Native Method)
            at 
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
            at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
            at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
            - locked <0x00000000c0a7a470> (a java.lang.Object)
    
       Locked ownable synchronizers:
            - None
    
    ...
    
    "DefaultQuartzScheduler_Worker-2" #68 prio=5 os_prio=0 
tid=0x00007fb41d3c8800 nid=0x1b519 waiting on condition [0x00007fb3da473000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
            at java.lang.Thread.sleep(Native Method)
            at 
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
            at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
            at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
            - locked <0x00000000c0a7a7b0> (a java.lang.Object)
    
       Locked ownable synchronizers:
            - None
    
    "DefaultQuartzScheduler_Worker-1" #67 prio=5 os_prio=0 
tid=0x00007fb41d3cc800 nid=0x1b518 waiting on condition [0x00007fb3da372000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
            at java.lang.Thread.sleep(Native Method)
            at 
org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:889)
            at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
            at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
            - locked <0x00000000c0a7dd90> (a java.lang.Object)
    
       Locked ownable synchronizers:
            - None
    ```
    
    The above thread dump says that all of the worker threads get stuck at 
https://github.com/apache/zeppelin/blob/v0.7.3/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Notebook.java#L889.
    
    One way to reproduce this kind of issue is creating a paragraph whose 
status is "READY" and "disable run". That makes the paragraph status "READY" 
permanently and `note.isTerminated()` never turns to `true`.
    
    To fix this issue, the following two improvements has been made at this PR:
    
    1. Remove the unnecessary `while (!note.isTerminated()) { ... }` block 
because the execution of all of the paragraphs is finished after 
`note.runAll()`.
    2. Skip the cron execution if there is a running or pending paragraph. That 
prevents the Zeppelin cron scheduler from getting stuck by the long running 
paragraph whose execution duration is greater than the cron execution cycle.
    
    ### What type of PR is it?
    [Improvement]
    
    ### Todos
    
    ### What is the Jira issue?
    https://issues.apache.org/jira/browse/ZEPPELIN-3077
    
    ### How should this be tested?
    * Tested manually.
        1. The cron scheduler does not get stuck if there is a paragraph whose 
status is "READY" and "disable run".
        2. The following message is printed on the log file when the cron job 
is launched while the previous cron job still has been running.
            * `execution of the cron job is skipped because there is a running 
or pending paragraph (note id: XXXXXXXXX)`
    
    ### Screenshots (if appropriate)
    
    ### Questions:
    * Does the licenses files need update? No.
    * Is there breaking changes for older versions? No.
    * Does this needs documentation? No.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kjmrknsn/zeppelin ZEPPELIN-3077

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zeppelin/pull/2687.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2687
    
----
commit 79824833fb3f41581e41cfa9f07839f5466c5d33
Author: Keiji Yoshida <kjmrk...@gmail.com>
Date:   2017-11-27T10:48:47Z

    [ZEPPELIN-3077] Cron scheduler is easy to get stuck when one of the cron 
jobs takes long time or gets stuck

----


---

Reply via email to