On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmh...@gmail.com> wrote: > I don't mind a loop, but that one looks broken. We have to clear the > bit before we call the function that processes that type of barrier. > Otherwise, if we succeed in absorbing the barrier but a new instance > of the same barrier arrives meanwhile, we'll fail to realize that we > need to absorb the new one.
Here's a new version of the patch for allowing errors in barrier-handling functions and/or rejection of barriers by those functions. I think this responds to all of the previous review comments from Andres. Also, here is an 0002 which is a handy bit of test code that I wrote. It's not for commit, but it is useful for finding bugs. In addition to improving 0001 based on the review comments, I also tried to write a better commit message for it, but it might still be possible to do better there. It's a bit hard to explain the idea in the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process with an XID -- and possibly a bunch of sub-XIDs, and possibly while idle-in-transaction -- can elect to FATAL rather than absorbing the barrier. I suspect for other barrier types we might have certain (hopefully short) stretches of code where a barrier of a particular type can't be absorbed because we're in the middle of doing something that relies on the previous value of whatever state is protected by the barrier. Holding off interrupts in those stretches of code would prevent the barrier from being absorbed, but would also prevent query cancel, backend termination, and absorption of other barrier types, so it seems possible that just allowing the barrier-absorption function for a barrier of that type to just refuse the barrier until after the backend exits the critical section of code will work out better. Just for kicks, I tried running 'make installcheck-parallel' while emitting placeholder barriers every 0.05 s after altering the barrier-absorption function to always return false, just to see how ugly that was. In round figures, it made it take 24 s vs. 21 s, so it's actually not that bad. However, it all depends on how many times you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine that the effect might be very non-uniform. That is, if you can get the code to be running a tight loop that does little real work but does CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of barrier, it will probably suck. Therefore, I'm inclined to think that the fairly strong cautionary logic in the patch is reasonable, but perhaps it can be better worded somehow. Thoughts welcome. I have not rebased the remainder of the patch series over these two. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
0001-Allow-for-error-or-refusal-while-absorbing-barriers.patch
Description: Binary data
0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patch
Description: Binary data