Hi, On 2020-04-01 12:02:18 -0400, Robert Haas wrote: > On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <and...@anarazel.de> wrote: > > There's really no reason at all to have bins of one minute. As it's a > > PGC_POSTMASTER GUC, it should just have didided time into bins of > > (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold > > of a week there's no need to keep 10k bins, and the minimum threshold of > > 1 minute obviously is problematic. > > I am very doubtful that this approach would have been adequate. It > would mean that, with old_snapshot_threshold set to a week, the > threshold for declaring a snapshot "old" would jump forward 16.8 hours > at a time. It's hard for me to make a coherent argument right now as > to exactly what problems that would create, but it's not very > granular, and a lot of bloat-related things really benefit from more > granularity. I also don't really see what the problem with keeping a > bucket per minute in memory is, even for a week. It's only 60 * 24 * 7 > = ~10k buckets, isn't it? That's not really insane for an in-memory > data structure. I agree that the code that does that maintenance being > buggy is a problem to whatever extent that is the case, but writing > the code to have fewer buckets wouldn't necessarily have made it any > less buggy.
My issue isn't really that it's too many buckets right now, but that it doesn't scale down to smaller thresholds. I think to be able to develop this reasonably, it'd need to be able to support thresholds in the sub-second range. And I don't see how you can have the same binning for such small thresholds, and for multi-day thresholds - we'd quickly go to millions of buckets for longer thresholds. I really think we'd need to support millisecond resolution to make this properly testable. > > GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the > > basis to detect conflicts seems dangerous to me. Isn't that ignoring > > inserts that are already in progress? > How so? Because it returns the end of the reserved WAL, not how far we've actually inserted. I.e. there can be in-progress, but not finished, modifications that will have an LSN < GetXLogInsertRecPtr(). But the whenTaken timestamp could reflect one that should throw an error for these in-progress modifications (since the transaction limiting happens before the WAL logging). I am not 100%, but I suspect that that could lead to errors not being thrown that should, because TestForOldSnapshot() will not see these in-progress modifications as conflicting. Hm, also, shouldn't && PageGetLSN(page) > (snapshot)->lsn) in TestForOldSnapshot() be an >=? > > It currently silently causes wrong query results. There's no > > infrastructure to detect / protect against that (*). > > Sure, and what if you break more stuff ripping it out? Ripping this > volume of code out in a supposedly-stable branch is totally insane > almost no matter how broken the feature is. For the backbranches I was just thinking of forcing the GUC to be off (either by disallowing it to be set to on, or just warning when its set to true, but not propagating the value). > I have no objection to the idea that *if* the feature is hopelessly > broken, it should be removed. I would be a lot less inclined to go that way if old_snapshot_threshold a) weren't explicitly about removing still-needed data - in contrast to a lot of other features, where the effects of bugs is temporary, here it can be much larger. b) were a previously working feature, but as far as I can tell, it never really did c) had tests that verify that my fixes actually do the right thing. As it stands, I'd not just have to fix the bugs, I'd also have to develop a test framework that can test this While I wish I had been more forceful, and reviewed more of the code to point out more of the quality issues, I did argue hard against the feature going in. On account of it being architecturally bad and impactful. Which I think it has proven to be several times over by now. And now I'm kind of on the hook to fix it, it seems? > I also think, and we've had this disagreement before, that you're far > too willing to say "well, that's wrong so we need to hit it with a > nuke." I complained when you added those error checks to vacuum in > back-branches, and since that release went out people are regularly > tripping those checks and taking prolonged outages for a problem that > wasn't making them unhappy before. I know that in theory those people > are better off because their database was always corrupted and now > they know. But for some of them, those prolonged outages are worse > than the problem they had before. I think this is a somewhat revisionist. Sure, the errors were added after like the 10th data corruption bug around freezing that we didn't find for a long time, because of the lack of errors being thrown. But the error checks weren't primarily added to find further bugs, but to prevent data loss due to the fixed bug. Of which we had field reports. I'd asked over *weeks* for reviews of the bug fixes. Not a single person expressed concerns about throwing new errors at that time. First version of the patches with the errors: https://postgr.es/m/20171114030341.movhteyakqeqx5pm%40alap3.anarazel.de I pushed them over a month later https://postgr.es/m/20171215023059.oeyindn57oeis5um%40alap3.anarazel.de There also wasn't (and isn't) a way to just report back that we can't currently freeze the individual page, without doing major surgery. And even if there were, what are supposed to do other than throw an error? We need to remove tuples below relfrozenxid, or we corrupt the table. As I've first asked before when you complained about those errors: What was the alternative? Just have invisible tuples reappear? Delete them? I don't think you've ever answered that. You brought this up as an example for me being over-eager with errors checks before. But I don't see how that meshes with the history visible in the thread referenced above. The more general issue, about throwing errors, is not just about the people that don't give a hoot about whether their data evolves on its own (perhaps a good tradeoff for them). Not throwing errors affects *everyone*. Some people do care about their data. Without errors we never figure out that we screwed up. And long-term, even the people that care much more about availability than data loss, benefit from the whole system getting more robust. We've since found numerous further data corrupting bugs because of the relfrozenxid checks. Some of very long standing vintage. Some in newly added code. Yes, hypothetically, I'd argue for introducing the checks solely for the sake of finding bugs. Even if I were prescient to forsee the number of issues caused (although I'd add block numbers to the error message from the get go, knowing that). But I'd definitely not do so in the back branches. > I believe it was irresponsible to decide on behalf of our entire user > base that they were better off with such a behavior change in a > supposedly-stable branch, and I believe the same thing here. As I explained above, I don't think that's fair with regard to the relfrozenxid errors. Setting that aside: In these discussions you always seem to only argue for the people that don't care about their data. But, uh, a lot of people do - should we just silently eat their data? And the long-term quality of the project gets a lot better by throwing errors, because it actually allows us to fix them. As far as I can tell we couldn't even have added the checks to master, back then, if we follow your logic: A lot of the reports about hitting the errors were with 11+ (partially due to pg_upgrade, partially because they detected other bugs). The likelihood of hurting people by adding checks at a later point would be a lot lower, if we stopped adding code that ignores errors silently and hoping for the best. But we keep adding such "lenient" code. We just found another long-standing cause of data corrupting, that should have been found earlier if we had errors, or at least warnings, btw. The locking around vac_udpate_datfrozenxid() has been broken for a long long time, but the silent 'if (bogus) return' made it very hard to find. https://www.postgresql.org/message-id/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de Also, I've recently seen a number of databases beeing eaten because we just ignore our own WAL logging rules to avoid throwing hard enough errors (RelationTruncate() WAL logging the truncation outside of a critical section - oops if you hit it, your primary and replicas/backups diverge, among many other bad consequences). Greetings, Andres Freund