Hi, On 2020-06-04 15:13:29 -0400, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > On 2020-06-04 14:50:40 -0400, Tom Lane wrote: > >> 2. The computed completePasses value would go backwards. I bet > >> that wouldn't matter too much either, or at least we could teach > >> BgBufferSync to cope. (I notice the comments therein suggest that > >> it is already designed to cope with completePasses wrapping around, > >> so maybe nothing needs to be done.) > > > If we're not concerned about that, then we can remove the > > atomic-inside-spinlock, I think. The only reason for that right now is > > to avoid assuming a wrong pass number. > > Hmm. That might be a less-invasive path to a solution. I can take > a look, if you don't want to.
First, I think it would be problematic: /* * Find out where the freelist clock sweep currently is, and how many * buffer allocations have happened since our last call. */ strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc); ... /* * Compute strategy_delta = how many buffers have been scanned by the * clock sweep since last time. If first time through, assume none. Then * see if we are still ahead of the clock sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { int32 passes_delta = strategy_passes - prev_strategy_passes; strategy_delta = strategy_buf_id - prev_strategy_buf_id; strategy_delta += (long) passes_delta * NBuffers; Assert(strategy_delta >= 0); ISTM that if we can get an out-of-sync strategy_passes and strategy_buf_id we'll end up with a pretty wrong strategy_delta. Which, I think, can cause to reset bgwriter's position: else { /* * We're behind, so skip forward to the strategy point and start * cleaning from there. */ #ifdef BGW_DEBUG elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld", next_passes, next_to_clean, strategy_passes, strategy_buf_id, strategy_delta); #endif next_to_clean = strategy_buf_id; next_passes = strategy_passes; bufs_to_lap = NBuffers; } While I think that the whole logic in BgBufferSync doesn't make a whole lot of sense, it does seem to me this has a fair potential to make it worse. In a scenario with a decent cache hit ratio (leading to high usagecounts) and a not that large NBuffers, we can end up up doing quite a few passes (as in many a second), so it might not be that hard to hit this. I am not immediately coming up with a cheap solution that doesn't do the atomics-within-spinlock thing. The best I can come up with is using a 64bit atomic, with the upper 32bit containing the number of passes, and the lower 32bit containing the current buffer. Where the lower 32bit / the buffer is handled like it currently is, i.e. we "try" to keep it below NBuffers. So % is only used for the "cold" path. That'd just add a 64->32 bit cast in the hot path, which shouldn't be measurable. But it'd regress platforms without 64bit atomics substantially. We could obviously also just rewrite the BgBufferSync() logic, so it doesn't care about things like "lapping", but that's not an easy change. So the best I can really suggest, unless we were to agree on atomics being ok inside spinlocks, is probably to just replace the spinlock with an lwlock. That'd perhaps cause a small slowdown for a few cases, but it'd make workload that e.g. use the freelist a lot (e.g. when tables are dropped regularly) scale better. Greetings, Andres Freund