While reviewing an amcheck patch of Andrey Borodin's, I noticed that it had a problem that I tied back to btree_xlog_split()'s loose approach to locking buffers compared to the primary [1] (i.e. compared to _bt_split()). This created a problem the proposed new check that is not unlike the problem that backwards scans running on standbys had with "concurrent" page deletions -- that was a legitimate bug that was fixed in commit 9a9db08a.
I'm starting to think that we should bite the bullet and not release all same-level locks within btree_xlog_split() until the very end, along with the existing right sibling page whose left link we need to update. In other words, "couple" the locks in the manner of _bt_split(), though only for same-level pages (just like btree_xlog_unlink_page() after commit 9a9db08a). That would make it okay to commit Andrey's patch, but it also seems like a good idea on general principle. (Note that I'm not proposing cross-level lock coupling on replicas, which seems unnecessary. Actually it's not really possible to do that because cross-level locks span multiple atomic actions/WAL records on the primary.) Presumably the lock coupling on standbys will have some overhead, but that seems essentially the same as the overhead on the primary. The obvious case to test (to evaluate the overhead of being more conservative in btree_xlog_split()) is a workload where we continually split the rightmost page. That's not actually relevant, though, since there is no right sibling to update when we split the rightmost page. My sense is that the current approach to locking taken within btree_xlog_split() is kind of an accident, not something that was pursued as a special optimization for the REDO routine at some point. Commit 3bbf668d certainly creates that impression. But I might have missed something. [1] postgr.es/m/CAH2-Wzm3=SLwu5=z8qG6UBpCemZW3dUNXWbX-cpXCgb=y3o...@mail.gmail.com -- Peter Geoghegan