On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <p...@bowt.ie> wrote: > On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmh...@gmail.com> wrote: >> More comments: > > Attached patch has all open issues worked through, including those > that I respond to or comment on below, as well as the other feedback > from your previous e-mails. Note also that I fixed the issue that Amit > raised, >
I could still reproduce it. I think the way you have fixed it has a race condition. In _bt_parallel_scan_and_sort(), the value of brokenhotchain is set after you signal the leader that the worker is done (by incrementing workersFinished). Now, the leader is free to decide based on the current shared state which can give the wrong value. Similarly, I think the value of havedead and reltuples can also be wrong. You neither seem to have fixed nor responded to the second problem mentioned in my email upthread [1]. To reiterate, the problem is that we can't assume that the workers we have launched will always start and finish. It is possible that postmaster fails to start the worker due to fork failure. In such conditions, tuplesort_leader_wait will hang indefinitely because it will wait for the workersFinished count to become equal to launched workers (+1, if leader participates) which will never happen. Am I missing something due to which this won't be a problem? Now, I think one argument is that such a problem can happen in a parallel query, so it is not the responsibility of this patch to solve it. However, we already have a patch (there are some review comments that needs to be addressed in the proposed patch) to solve it and this patch is adding a new path in the code which has similar symptoms which can't be fixed with the already proposed patch. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BizMyxzFD6k81Deyar35YJ5qdpbRTUp9cQvo%2BniQom7Q%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com