[ https://issues.apache.org/jira/browse/HADOOP-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-18217. ------------------------------------- Fix Version/s: 3.4.0 3.3.9 Resolution: Fixed committed to branch-3.3 and above > shutdownhookmanager should not be multithreaded (deadlock possible) > ------------------------------------------------------------------- > > Key: HADOOP-18217 > URL: https://issues.apache.org/jira/browse/HADOOP-18217 > Project: Hadoop Common > Issue Type: Bug > Components: util > Affects Versions: 2.10.1 > Environment: linux, windows, any version > Reporter: Catherinot Remi > Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.9 > > Attachments: wtf.java > > Time Spent: 4h 40m > Remaining Estimate: 0h > > the ShutdownHookManager class uses an executor to run hooks to have a > "timeout" notion around them. It does this using a single threaded executor. > It can leads to deadlock leaving a never-shutting-down JVM with this > execution flow: > * JVM need to exit (only daemon threads remaining or someone called > System.exit) > * ShutdowHookManager kicks in > * SHMngr executor start running some hooks > * SHMngr executor thread kicks in and, as a side effect, run some code from > one of the hook that calls System.exit (as a side effect from an external lib > for example) > * the executor thread is waiting for a lock because another thread already > entered System.exit and has its internal lock, so the executor never returns. > * SHMngr never returns > * 1st call to System.exit never returns > * JVM stuck > > using an executor with a single thread does "fake" timeouts (the task keeps > running, you can interrupt it but until it stumble upon some piece of code > that is interruptible (like an IO) it will keep running) especially since the > executor is a single threaded one. So it has this bug for example : > * caller submit 1st hook (bad one that would need 1 hour of runtime and that > cannot be interrupted) > * executor start 1st hook > * caller of the future 1st hook result timeout > * caller submit 2nd hook > * bug : 1 hook still running, 2nd hook triggers a timeout but never got the > chance to run anyway, so 1st faulty hook makes it impossible for any other > hook to have a chance to run, so running hooks in a single separate thread > does not allow to run other hooks in parallel to long ones. > > If we really really want to timeout the JVM shutdown, even accepting maybe > dirty shutdown, it should rather handle the hooks inside the initial thread > (not spawning new one(s) so not triggering the deadlock described on the 1st > place) and if a timeout was configured, only spawn a single parallel daemon > thread that sleeps the timeout delay, and then use Runtime.halt (which bypass > the hook system so should not trigger the deadlock). If the normal > System.exit ends before the timeout delay everything is fine. If the > System.exit took to much time, the JVM is killed and so the reason why this > multithreaded shutdown hook implementation was created is satisfied (avoding > having hanging JVMs) > > Had the bug with both oracle and open jdk builds, all in 1.8 major version. > hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in > another thread > > Another solution is of course to configure the timeout AND to have as many > threads as needed to run the hooks so to have at least some gain to offset > the pain of the dealock scenario > > EDIT: added some logs and reproduced the problem. in fact it is located after > triggering all the hook entries and before shutting down the executor. > Current code, after running the hooks, creates a new Configuration object and > reads the configured timeout from it, applies this timeout to shutdown the > executor. I sometimes run with a classloader doing remote classloading, > Configuration loads its content using this classloader, so when shutting down > the JVM and some network error occurs the classloader fails to load the > ressources needed by Configuration. So the code crash before shutting down > the executor and ends up inside the thread's default uncaught throwable > handler, which was calling System.exit, so got stuck, so shutting down the > executor never returned, so does the JVM. > So, forget about the halt stuff (even if it is a last ressort very robust > safety net). Still I'll do a small adjustement to the final executor shutdown > code to be slightly more robust to even the strangest exceptions/errors it > encounters. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org