Hi, On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > On 12/06/2025 4:57 pm, Andres Freund wrote: > > The problem appears to be in that switch between "when submitted, by the IO > > worker" and "then again by the backend". It's not concurrent access in the > > sense of two processes writing to the same value, it's that when switching > > from the worker updating ->distilled_result to the issuer looking at that, > > the > > issuer didn't ensure that no outdated version of ->distilled_result could be > > used. > > > > Basically, the problem is that the worker would > > > > 1) set ->distilled_result > > 2) perform a write memory barrier > > 3) set ->state to COMPLETED_SHARED > > > > and then the issuer of the IO would: > > > > 4) check ->state is COMPLETED_SHARED > > 5) use ->distilled_result > > > > The problem is that there currently is no barrier between 4 & 5, which means > > an outdated ->distilled_result could be used. > > > > > > This also explains why the issue looked so weird - eventually, after > > fprintfs, > > after a core dump, etc, the updated ->distilled_result result would "arrive" > > in the issuing process, and suddenly look correct. > > Thank you very much for explanation. > Everything seems to be so simple after explanations, that you can not even > believe that before you think that such behavior can be only caused by > "black magic" or "OS bug":)
Indeed. For a while I was really starting to doubt my sanity - not helped by the fact that I'd loose most of the mental context between reproductions (the problem did not re-occur from Saturday to Wednesday morning...). What finally made it clearer was moving the failing assertion to earlier (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as that took out a lot of the time in which the problem could occur. > Certainly using outdated result can explain such behavior. > But in which particular place we loose read barrier between 4 and 5? > I see `pgaio_io_wait` which as I expect should be called by backend to wait > completion of IO. > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce > read barrier: > ``` > > bool > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, > PgAioHandleState *state) > { > *state = ioh->state; > pg_read_barrier(); > > return ioh->generation != ref_generation; > } > ``` Yes, I don't think that path has the issue. As far as I can tell the problem only ever happens if we end up reclaiming an IO while waiting for a free IO handle. The known problematic path is pgaio_io_acquire() -> pgaio_io_wait_for_free() -> pgaio_io_reclaim() In that path we don't use pgaio_io_was_recycled(). I couldn't find any other path with the same issue [1]. The issue only happening while waiting for free IO handles is presumably the reason why it happend in 027_stream_regress.pl, due to the small shared_buffers used for the test io_max_concurrency is 1, which means we constantly need to wait for a free IO handles. Greetings, Andres Freund [1] I think theoretically we are missing a memory barrier in the worker, when starting to process an IO, but the lwlock for getting the IO to process provides that. But we should probably add a comment about that.