[jira] [Resolved] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

Steve Loughran (Jira) Wed, 13 Jul 2022 04:45:18 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran resolved HADOOP-18217.
-------------------------------------
    Fix Version/s: 3.4.0
                   3.3.9
       Resolution: Fixed

committed to branch-3.3 and above

> shutdownhookmanager should not be multithreaded (deadlock possible)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-18217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 2.10.1
>         Environment: linux, windows, any version
>            Reporter: Catherinot Remi
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.3.9
>
>         Attachments: wtf.java
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> the ShutdownHookManager class uses an executor to run hooks to have a 
> "timeout" notion around them. It does this using a single threaded executor. 
> It can leads to deadlock leaving a never-shutting-down JVM with this 
> execution flow:
>  * JVM need to exit (only daemon threads remaining or someone called 
> System.exit)
>  * ShutdowHookManager kicks in
>  * SHMngr executor start running some hooks
>  * SHMngr executor thread kicks in and, as a side effect, run some code from 
> one of the hook that calls System.exit (as a side effect from an external lib 
> for example)
>  * the executor thread is waiting for a lock because another thread already 
> entered System.exit and has its internal lock, so the executor never returns.
>  * SHMngr never returns
>  * 1st call to System.exit never returns
>  * JVM stuck
>  
> using an executor with a single thread does "fake" timeouts (the task keeps 
> running, you can interrupt it but until it stumble upon some piece of code 
> that is interruptible (like an IO) it will keep running) especially since the 
> executor is a single threaded one. So it has this bug for example :
>  * caller submit 1st hook (bad one that would need 1 hour of runtime and that 
> cannot be interrupted)
>  * executor start 1st hook
>  * caller of the future 1st hook result timeout
>  * caller submit 2nd hook
>  * bug : 1 hook still running, 2nd hook triggers a timeout but never got the 
> chance to run anyway, so 1st faulty hook makes it impossible for any other 
> hook to have a chance to run, so running hooks in a single separate thread 
> does not allow to run other hooks in parallel to long ones.
>  
> If we really really want to timeout the JVM shutdown, even accepting maybe 
> dirty shutdown, it should rather handle the hooks inside the initial thread 
> (not spawning new one(s) so not triggering the deadlock described on the 1st 
> place) and if a timeout was configured, only spawn a single parallel daemon 
> thread that sleeps the timeout delay, and then use Runtime.halt (which bypass 
> the hook system so should not trigger the deadlock). If the normal 
> System.exit ends before the timeout delay everything is fine. If the 
> System.exit took to much time, the JVM is killed and so the reason why this 
> multithreaded shutdown hook implementation was created is satisfied (avoding 
> having hanging JVMs)
>  
> Had the bug with both oracle and open jdk builds, all in 1.8 major version. 
> hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in 
> another thread
>  
> Another solution is of course to configure the timeout AND to have as many 
> threads as needed to run the hooks so to have at least some gain to offset 
> the pain of the dealock scenario
>  
> EDIT: added some logs and reproduced the problem. in fact it is located after 
> triggering all the hook entries and before shutting down the executor. 
> Current code, after running the hooks, creates a new Configuration object and 
> reads the configured timeout from it, applies this timeout to shutdown the 
> executor. I sometimes run with a classloader doing remote classloading, 
> Configuration loads its content using this classloader, so when shutting down 
> the JVM and some network error occurs the classloader fails to load the 
> ressources needed by Configuration. So the code crash before shutting down 
> the executor and ends up inside the thread's default uncaught throwable 
> handler, which was calling System.exit, so got stuck, so shutting down the 
> executor never returned, so does the JVM.
> So, forget about the halt stuff (even if it is a last ressort very robust 
> safety net). Still I'll do a small adjustement to the final executor shutdown 
> code to be slightly more robust to even the strangest exceptions/errors it 
> encounters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

Reply via email to