On Fri, 12 May 2023 15:24:19 GMT, Volker Simonis <simo...@openjdk.org> wrote:

> Since JDK13, executing commands in a sub-process defaults to the so called 
> `POSIX_SPAWN` launching mechanism (i.e. 
> `-Djdk.lang.Process.launchMechanism=POSIX_SPAWN`) on Linux. This works by 
> using `posix_spawn(3)` to firstly start a tiny helper program called 
> `jspawnhelper` in a subprocess. In a second step, `jspawnhelper` reads the 
> command data from the parent Java process over a Unix pipe and finally 
> executes (i.e. `execvp(3)`) the requested command.
> 
> In cases where the parent process terminates abnormally before `jspawnhelper` 
> has read all the expected data from the pipe, `jspawnhelper` will block 
> indefinitely on the reading end of the pipe. This is especially harmful if 
> the parent process had open sockets, because in that case, the forked 
> `jspawnhelper` process will inherit them and keep all the corresponding ports 
> open effectively preventing other processes to bind to them. Notice that this 
> is not an abstract scenario. We've observed this regularly in production with 
> services which couldn't be restarted after a crash after migrating to JDK 17.
> 
> The fix of the issue is rather trivial. `jspawnhelper` has to close its 
> writing end of the pipe which connects it with the parent Java process 
> *before* starting to read from that pipe such that reads from the pipe can 
> immediately return with EOF if the parent process terminates abnormally.
> 
> Also did some cleanup:
>  - in `jspawnhelper.c` correctly chek the result of `readFully()`
>  - in `ProcessImpl_md.c` use `restartableWrite()` instead of `write()` to 
> write the command data from the parent process to `jspawnhelper`
>  - in `childproc.{c,h}` remove remaining traces of `MODE_CLONE` because 
> that's long gone.

> > > Looks ok.
> > > Is it practical to write a test for this situation? Can I assume you've 
> > > validated the improvement as a remedy for the observed hangs?
> > 
> > 
> > I thought about a test but couldn't come up with a practical way to write 
> > it. The JVM has to exit in the time frame after the `posix_spawn()` and 
> > before `jspawnhelper` has read its data from the parent. In production this 
> > usually happens due to memory constraints on the particular host which let 
> > the OOM-killer kill the JVM process because it is the biggest memory 
> > consumer.
> 
> A regression test would be good.
> 
> This can be very simply a runtime switch that kills the parent process at 
> vulnerable times. See my example here:
> 
> [master...tstuefe:jdk:test-for-parent-premature-death](https://github.com/openjdk/jdk/compare/master...tstuefe:jdk:test-for-parent-premature-death)
> 
> Using that (KILLTEST=1), I was able to reproduce your problem with the 
> hanging child. Using this to build a jtreg test is not hard.
> 
> Cheers, Thomas

Do you really think it makes sense to add such test code to the native runtime 
libraries?

I'm not sure. What do others think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13956#issuecomment-1548163835

Reply via email to