On Thu, Jan 25, 2018 at 9:28 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <p...@bowt.ie> wrote: >>> I have used Thomas' chaos-monkey-fork-process.patch to verify: >>> >>> 1. The problem of fork failure causing nbtsort.c to wait forever is a >>> real problem. Sure enough, the coding pattern within >>> _bt_leader_heapscan() can cause us to wait forever even with commit >>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a >>> consequence of the patch not using tuple queues (it uses the new >>> tuplesort sharing thing instead). >> >> Just curious: does the attached also help? > > I can still reproduce the problem without the fix I described (which > does work), using your patch instead. > > Offhand, I suspect that the way you set ParallelMessagePending may not > always leave it set when it should be.
Here's a version that works, and a minimal repro test module thing. Without 0003 applied, it hangs. With 0003 applied, it does this: postgres=# call test_fork_failure(); CALL postgres=# call test_fork_failure(); CALL postgres=# call test_fork_failure(); ERROR: lost connection to parallel worker postgres=# call test_fork_failure(); ERROR: lost connection to parallel worker I won't be surprised if 0003 is judged to be a horrendous abuse of the interrupt system, but these patches might at least be useful for understanding the problem. -- Thomas Munro http://www.enterprisedb.com
0001-Chaos-monkey-fork-failure.patch
Description: Binary data
0002-A-simple-test-module-that-hangs-on-fork-failure.patch
Description: Binary data
0003-Pessimistic-fork-failure-detector.patch
Description: Binary data