Hi, On 2020-04-01 15:11:52 -0500, Kevin Grittner wrote: > On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <and...@anarazel.de> wrote: > > > The thing that makes me really worried is that the contents of the time > > mapping seem very wrong. I've reproduced query results in a REPEATABLE > > READ transaction changing (pruned without triggering an error). > > > That is a very big problem. On the sort-of bright side (ironic in light of > the fact that I'm a big proponent of using serializable transactions), none > of the uses that I have personally seen of this feature use anything other > than the default READ COMMITTED isolation level. That might help explain > the lack of complaints for those using the feature. But yeah, I REALLY > want to see a solid fix for that!
I don't think it's dependent on RR - it's just a bit easier to verify that the query results are wrong that way. > > And I've > > reproduced rows not getting removed for much longer than than they > > should, according to old_snapshot_threshold. > > > > I suspect one reason for users not noticing either is that > > > > a) it's plausible that users of the feature would mostly have > > long-running queries/transactions querying immutable or insert only > > data. Those would not notice that, on other tables, rows are getting > > removed, where access would not trigger the required error. > > > > b) I observe long-ish phases were no cleanup is happening (due to > > oldSnapshotControl->head_timestamp getting updated more often than > > correct). But if old_snapshot_threshold is small enough in relation to > > the time the generated bloat becomes problematic, there will still be > > occasions to actually perform cleanup. > > > > Keep in mind that the real goal of this feature is not to eagerly _see_ > "snapshot too old" errors, but to prevent accidental debilitating bloat due > to one misbehaving user connection. I don't think it's an "intentional" inaccuracy issue leading to this. The map contents are just wrong, in particular the head_timestamp most of the time is so new that TransactionIdLimitedForOldSnapshots(). When filling a new bucket, MaintainOldSnapshotThreshold() unconditionally updates oldSnapshotControl->head_timestamp to be the current minute, which means it'll take old_snapshot_threshold minutes till TransactionIdLimitedForOldSnapshots() even looks at the mapping again. As far as I can tell, with a large old_snapshot_threshold, it can take a very long time to get to a head_timestamp that's old enough for TransactionIdLimitedForOldSnapshots() to do anything. Look at this trace of a pgbench run with old_snapshot_threshold enabled, showing some of the debugging output added in the patch upthread. This is with a threshold of 10min, in a freshly started database: > 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING: head 0 > min: filling 1 buckets starting at 0 for whenTaken 1 min, with xmin 2068881994 > 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING: old > snapshot mapping at "after update" with head ts: 1, current entries: 2 max > entries: 20, offset: 0 > entry 0 (ring 0): min 1: xid 2068447214 > entry 1 (ring 1): min 2: xid 2068881994 > > 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: old snapshot > mapping at "before update" with head ts: 1, current entries: 2 max entries: > 20, offset: 0 > entry 0 (ring 0): min 1: xid 2068447214 > entry 1 (ring 1): min 2: xid 2068881994 > > 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: head 1 min: > updating existing bucket 1 for whenTaken 2 min, with xmin 2069199511 > 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: old snapshot > mapping at "after update" with head ts: 1, current entries: 2 max entries: > 20, offset: 0 > entry 0 (ring 0): min 1: xid 2068447214 > entry 1 (ring 1): min 2: xid 2069199511 > > 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: old > snapshot mapping at "before update" with head ts: 1, current entries: 2 max > entries: 20, offset: 0 > entry 0 (ring 0): min 1: xid 2068447214 > entry 1 (ring 1): min 2: xid 2069199511 > > 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: head 1 > min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 2069516499 > 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: old > snapshot mapping at "after update" with head ts: 3, current entries: 4 max > entries: 20, offset: 0 > entry 0 (ring 0): min 3: xid 2068447214 > entry 1 (ring 1): min 4: xid 2069199511 > entry 2 (ring 2): min 5: xid 2069516499 > entry 3 (ring 3): min 6: xid 2069516499 > ... > 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: old > snapshot mapping at "before update" with head ts: 7, current entries: 8 max > entries: 20, offset: 0 > entry 0 (ring 0): min 7: xid 2068447214 > entry 1 (ring 1): min 8: xid 2071112480 > entry 2 (ring 2): min 9: xid 2071434473 > entry 3 (ring 3): min 10: xid 2071755177 > entry 4 (ring 4): min 11: xid 2072075827 > entry 5 (ring 5): min 12: xid 2072395700 > entry 6 (ring 6): min 13: xid 2072715464 > entry 7 (ring 7): min 14: xid 2073035816 Before the mapping change the database had been running for 15 minutes. But the mapping starts only at 7 minutes from start. And then is updated to > 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: head 7 > min: filling 8 buckets starting at 0 for whenTaken 15 min, with xmin > 2075918093 > 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: old > snapshot mapping at "after update" with head ts: 15, current entries: 16 max > entries: 20, offset: 0 > entry 0 (ring 0): min 15: xid 2068447214 > entry 1 (ring 1): min 16: xid 2071112480 > entry 2 (ring 2): min 17: xid 2071434473 > entry 3 (ring 3): min 18: xid 2071755177 > entry 4 (ring 4): min 19: xid 2072075827 > entry 5 (ring 5): min 20: xid 2072395700 > entry 6 (ring 6): min 21: xid 2072715464 > entry 7 (ring 7): min 22: xid 2073035816 > entry 8 (ring 8): min 23: xid 2075918093 > entry 9 (ring 9): min 24: xid 2075918093 > entry 10 (ring 10): min 25: xid 2075918093 > entry 11 (ring 11): min 26: xid 2075918093 > entry 12 (ring 12): min 27: xid 2075918093 > entry 13 (ring 13): min 28: xid 2075918093 > entry 14 (ring 14): min 29: xid 2075918093 > entry 15 (ring 15): min 30: xid 2075918093 be considered having started in that moment. And we expand the size of the mapping by 8 at the same time, filling the new buckets with the same xid. Despite there being a continuous workload. After a few more minutes we get: > 2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING: head 15 > min: updating existing bucket 4 for whenTaken 19 min, with xmin 2077202085 > 2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING: old > snapshot mapping at "after update" with head ts: 15, current entries: 16 max > entries: 20, offset: 0 > entry 0 (ring 0): min 15: xid 2068447214 > entry 1 (ring 1): min 16: xid 2076238895 > entry 2 (ring 2): min 17: xid 2076559154 > entry 3 (ring 3): min 18: xid 2076880731 > entry 4 (ring 4): min 19: xid 2077202085 > entry 5 (ring 5): min 20: xid 2072395700 > entry 6 (ring 6): min 21: xid 2072715464 > entry 7 (ring 7): min 22: xid 2073035816 > entry 8 (ring 8): min 23: xid 2075918093 > entry 9 (ring 9): min 24: xid 2075918093 > entry 10 (ring 10): min 25: xid 2075918093 > entry 11 (ring 11): min 26: xid 2075918093 > entry 12 (ring 12): min 27: xid 2075918093 > entry 13 (ring 13): min 28: xid 2075918093 > entry 14 (ring 14): min 29: xid 2075918093 > entry 15 (ring 15): min 30: xid 2075918093 > Note how the xids are not monotonically ordered. And how IsLimited still won't be able to make use of the mapping, as the head timestamp is only 4 minutes old (whenTaken == 19 min, head == 15min). Greetings, Andres Freund