Hi, On 2022-11-14 09:50:48 -0800, Peter Geoghegan wrote: > On Mon, Nov 14, 2022 at 9:38 AM Andres Freund <and...@anarazel.de> wrote: > > Anyway, I played a bit around with this. It's hard to hit, not because we > > somehow won't choose such a horizon, but because we'll commonly prune the > > earlier tuple version away due to xmax being old enough. > > That must be a bug, then. Since, as I said, I can't see how it could > possibly be okay to freeze an xmin of tuple in a HOT chain without > also making sure that it has no earlier versions left behind.
Hard to imagine us having bugs in this code. Ahem. I really wish I knew of a reasonably complex way to utilize coverage guided fuzzing on heap pruning / vacuuming. I wonder if we ought to add an error check to heap_prepare_freeze_tuple() against this scenario. We're working towards being more aggressive around freezing, which will make it more likely to hit corner cases around this. > If there are earlier versions that we have to go through to get to the > frozen-xmin tuple (not just an LP_REDIRECT), we're going to break the HOT > chain traversal logic in code like heap_hot_search_buffer in a rather > obvious way. > > HOT chain traversal logic code will interpret the frozen xmin from the > tuple as FrozenTransactionId (not as its raw xmin). So traversal is > just broken in this scenario. > Which'd still be fine if the whole chain were already "fully dead". One way I think this can happen is <= PG 13's HEAPTUPLE_DEAD handling in lazy_scan_heap(). I now suspect that the seemingly-odd "We will advance past RECENTLY_DEAD tuples just in case there's a DEAD one after them;" logic in heap_prune_chain() might be required for correctness. Which IIRC we'd been talking about getting rid elsewhere? <tinkers> At least as long as correctness requires not ending up in endless loops - indeed. We end up with lazy_scan_prune() endlessly retrying. Without a chance to interrupt. Shouldn't there at least be a CFI somewhere? The attached isolationtester spec has a commented out test for this. I think the problem partially is that the proposed verify_heapam() code is too "aggressive" considering things to be part of the same hot chain - which then means we have to be very careful about erroring out. The attached isolationtester test triggers: "unfrozen tuple was updated to produce a tuple at offset %u which is frozen" "updated version at offset 3 is also the updated version of tuple at offset %u" Despite there afaict not being any corruption. Worth noting that this happens regardless of hot/non-hot updates being used (uncomment s3ci to see). > > It *is* possible to > > hit, if the horizon increases between the two tuple version checks (e.g. if > > there's another tuple inbetween that we check the visibility of). > > I suppose that that's the detail that "protects" us, then -- that > would explain the apparent lack of problems in the field. Your > sequence requires 3 sessions, not just 2. One important protection right now is that vacuumlazy.c uses a more pessimistic horizon than pruneheap.c. Even if visibility determinations within pruning recompute the horizon, vacuumlazy.c won't freeze based on the advanced horizon. I don't quite know where we we'd best put a comment with a warning about this fact. Greetings, Andres Freund
diff --git c/src/test/isolation/expected/skewer.out i/src/test/isolation/expected/skewer.out new file mode 100644 index 00000000000..e69de29bb2d diff --git c/src/test/isolation/specs/skewer.spec i/src/test/isolation/specs/skewer.spec new file mode 100644 index 00000000000..f9fa6ea8aa1 --- /dev/null +++ i/src/test/isolation/specs/skewer.spec @@ -0,0 +1,74 @@ +setup +{ + DROP TABLE IF EXISTS t; + DROP EXTENSION IF EXISTS amcheck; + DROP EXTENSION IF EXISTS pageinspect; + + CREATE EXTENSION amcheck; + CREATE EXTENSION pageinspect; + CREATE TABLE t(key int, d int); +} + +session s1 +step s1b {BEGIN; SELECT txid_current();} +step s1u1 {UPDATE t SET d = d + 1 WHERE key = 1; } +step s1i3 {INSERT INTO t VALUES (3, 1);} +step s1c {COMMIT;} + +session s2 +step s2b {BEGIN; SELECT txid_current();} +step s2c {COMMIT;} + +session s3 +step s3b {BEGIN; SELECT txid_current();} +step s3i1 {INSERT INTO t VALUES (1, 1);} +step s3i2 {INSERT INTO t VALUES (2, 1);} +step s3c {COMMIT;} +step s3a {ABORT;} +step s3u1 {UPDATE t SET d = d + 1 WHERE key = 1; } +step s3u2 {UPDATE t SET d = d + 1 WHERE key = 2; } + +step s3ci { CREATE INDEX ON t(d)} + +session s4 +step s4show { SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, mask.raw_flags, mask.combined_flags, t_hoff, t_bits FROM heap_page_items(get_raw_page('t', 0)), heap_tuple_infomask_flags(t_infomask, t_infomask2) AS mask;} +step s4verify { SELECT * FROM verify_heapam('t'); SELECT pg_backend_pid();} +step s4vac { VACUUM (VERBOSE) t; } +step s4vacfreeze { VACUUM (VERBOSE, FREEZE) t; } + +session s5 +step s5pin { BEGIN; DECLARE foo CURSOR FOR SELECT * FROM t; FETCH FROM foo;} + +### triggers endless loop without recent_dead logic in heap_prune_chain() +#permutation s1b s2b s3b +# s3i1 s3u1 s3c s3u1 s1u1 s1c +# s4show +# s4verify +# s4vac +# s4show +# s4verify +# s2c + +### triggers: +### updated version at offset 3 is also the updated version of tuple at offset 1 +### unfrozen tuple was updated to produce a tuple at offset 3 which is frozen +permutation + #s3ci + s1b s2b s3b + s3i1 s3i2 s3c + s3b s3u1 s3a + s4show + s4vac + s4show + s3b s3u2 s3a + s4vac + s4show + s4vac + s4verify + s4show + s1i3 + s1c + s4vacfreeze + s4show + s4verify + s2c