On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > Right, but what if the worker dies due to something proc_exit(1) or > something like that before calling BarrierArriveAndWait. I think this > is part of the problem we have solved in > WaitForParallelWorkersToFinish such that if the worker exits abruptly > at any point due to some reason, the system should not hang.
I have used Thomas' chaos-monkey-fork-process.patch to verify: 1. The problem of fork failure causing nbtsort.c to wait forever is a real problem. Sure enough, the coding pattern within _bt_leader_heapscan() can cause us to wait forever even with commit 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a consequence of the patch not using tuple queues (it uses the new tuplesort sharing thing instead). 2. Simply adding a single call to WaitForParallelWorkersToFinish() within _bt_leader_heapscan() before waiting on our condition variable fixes the problem -- errors are reliably propagated, and we never end up waiting forever. 3. This short term fix works just as well with parallel_leader_participation=off. At this point, my preferred solution is for someone to go implement Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems like the logical person for the job). Once that's committed, I can post a new version of the patch that uses that new infrastructure -- I'll add a call to the new function, without changing anything else. Failing that, we could actually just use WaitForParallelWorkersToFinish(). I still don't want to use a barrier, mostly because it complicates parallel_leader_participation=off, something that Amit is in agreement with [2][3]. For now, I am waiting for feedback from Robert on next steps. [1] https://postgr.es/m/CAH2-Wzm6dF=g9lywthgcqzrc4dzbe-8tv28yvg0xj8q6e4+...@mail.gmail.com [2] https://postgr.es/m/CAA4eK1LEFd28p1kw2Fst9LzgBgfMbDEq9wPh9jWFC0ye6ce62A%40mail.gmail.com [3] https://postgr.es/m/caa4ek1+a0of4m231vbgpr_0ygg_bnmrgzlib7wqde-fybsy...@mail.gmail.com -- Peter Geoghegan