On Mon, 22 May 2023 10:22:16 GMT, Volker Simonis <simo...@openjdk.org> wrote:

>> Since JDK13, executing commands in a sub-process defaults to the so called 
>> `POSIX_SPAWN` launching mechanism (i.e. 
>> `-Djdk.lang.Process.launchMechanism=POSIX_SPAWN`) on Linux. This works by 
>> using `posix_spawn(3)` to firstly start a tiny helper program called 
>> `jspawnhelper` in a subprocess. In a second step, `jspawnhelper` reads the 
>> command data from the parent Java process over a Unix pipe and finally 
>> executes (i.e. `execvp(3)`) the requested command.
>> 
>> In cases where the parent process terminates abnormally before 
>> `jspawnhelper` has read all the expected data from the pipe, `jspawnhelper` 
>> will block indefinitely on the reading end of the pipe. This is especially 
>> harmful if the parent process had open sockets, because in that case, the 
>> forked `jspawnhelper` process will inherit them and keep all the 
>> corresponding ports open effectively preventing other processes to bind to 
>> them. Notice that this is not an abstract scenario. We've observed this 
>> regularly in production with services which couldn't be restarted after a 
>> crash after migrating to JDK 17.
>> 
>> The fix of the issue is rather trivial. `jspawnhelper` has to close its 
>> writing end of the pipe which connects it with the parent Java process 
>> *before* starting to read from that pipe such that reads from the pipe can 
>> immediately return with EOF if the parent process terminates abnormally.
>> 
>> Also did some cleanup:
>>  - in `jspawnhelper.c` correctly chek the result of `readFully()`
>>  - in `ProcessImpl_md.c` use `restartableWrite()` instead of `write()` to 
>> write the command data from the parent process to `jspawnhelper`
>>  - in `childproc.{c,h}` remove remaining traces of `MODE_CLONE` because 
>> that's long gone.
>
> Volker Simonis has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'master' into JDK-8307990
>  - Address comments from tstuefe and RogerRiggs
>  - REALLY adding the test :)
>  - Added test case
>  - 8307990: jspawnhelper must close its writing side of a pipe before reading 
> from it

Hey, chiming on this conversation, because I believe it will fix an issue we 
encountered since, upgrading from JDK 11 to 17.
In our application we spawn a lot of processes over time and noticed that more 
and more JVM threads get stuck in the native `ProcessImpl.forkAndExec`.
<details>
<summary>Thread Dump Excerpt</summary>

Thread dump contains multiple threads in this state.

"redacted" #954 prio=5 os_prio=0 cpu=59718.38ms elapsed=286286.30s 
tid=0x00007f62f80201e0 nid=0x3c16 runnable  [0x00007f609ec56000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.ProcessImpl.forkAndExec(java.base@17.0.7/Native Method)
        at java.lang.ProcessImpl.<init>(java.base@17.0.7/ProcessImpl.java:314)
        at java.lang.ProcessImpl.start(java.base@17.0.7/ProcessImpl.java:244)
        at 
java.lang.ProcessBuilder.start(java.base@17.0.7/ProcessBuilder.java:1110)
        at 
java.lang.ProcessBuilder.start(java.base@17.0.7/ProcessBuilder.java:1073)

</details>

After some investigation we saw that we have `jspawnhelper` processes 
corresponding to the number of the stuck threads.
`/proc/*/task/{id}/syscall` shows that all of these processes are currently 
reading from the pipe.

Our theory is that the write calls from the parent thread were interrupted 
resulting in an incomplete write. 
https://github.com/openjdk/jdk/blob/69f508a2ac344eb61cef7be985348873b8265171/src/java.base/unix/native/libjava/ProcessImpl_md.c#L556-L559
This would cause the `readFully` on the `jspawnhelper` side to wait for the 
remaining data.
As the number of bytes written is not checked on the writer side it would 
continue and wait for the alive ping from the `jspawnhelper` which results in a 
deadlock.
I would be interested in a reproducer e.g. using `gdb`, but to be honest I 
wouldn't know where to start.
But regardless of whether this fixes our issue I think incomplete writes should 
be handled according to specification.
I think the `restartableWrites` in this PR do that?

We also mitigated by falling back to `VFORK` as was also mentioned in [this 
comment](https://github.com/openjdk/jdk/pull/13956#issuecomment-1547777042). 

Maybe it would be worth it to mention our bug situation in the release notes as 
well?
Thank you and please let me know if I got something wrong or there is anything 
I could provide to help!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13956#issuecomment-1558743318

Reply via email to