Greetings, * Michael Banck (michael.ba...@credativ.de) wrote: > Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra: > > On 2/5/19 8:01 AM, Andres Freund wrote: > > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote: > > > > > > > I'm wondering (possibly again) about the existing early exit if > > > > > > > one block > > > > > > > cannot be read on retry: the command should count this as a kind > > > > > > > of bad > > > > > > > block, proceed on checking other files, and obviously fail in the > > > > > > > end, but > > > > > > > having checked everything else and generated a report. I do not > > > > > > > think that > > > > > > > this condition warrants a full stop. ISTM that under rare race > > > > > > > conditions > > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this > > > > > > > could > > > > > > > happen when online, although I could not trigger one despite > > > > > > > heavy testing, > > > > > > > so I'm possibly mistaken. > > > > > > > > > > > > This seems like a defensible judgement call either way. > > > > > > > > > > Right now we have a few tests that explicitly check that > > > > > pg_verify_checksums fail on broken data ("foo" in the file). Those > > > > > would then just get skipped AFAICT, which I think is the worse > > > > > behaviour > > > > > , but if everybody thinks that should be the way to go, we can > > > > > drop/adjust those tests and make pg_verify_checksums skip them. > > > > > > > > > > Thoughts? > > > > > > > > My point is that it should fail as it does, only not immediately (early > > > > exit), but after having checked everything else. This mean avoiding > > > > calling > > > > "exit(1)" here and there (lseek, fopen...), but taking note that > > > > something > > > > bad happened, and call exit only in the end. > > > > > > I can see both as being valuable (one gives you a more complete picture, > > > the other a quicker answer in scripts). For me that's the point where > > > it's the prerogative of the author to make that choice.
... unless people here object or prefer other options, and then it's up to discussion and hopefully some consensus comes out of it. Also, I have to say that I really don't think the 'quicker answer' argument holds any weight, making me question if that's a valid use-case. If there *isn't* an issue, which we would likely all agree is the case the vast majority of the time that this is going to be run, then it's going to take quite a while and anyone calling it should expect and be prepared for that. In the extremely rare cases, what does exiting early actually do for us? > Personally, I would prefer to keep it as simple as possible for now and > get this patch committed; in my opinion the behaviour is already like > this (early exit on corrupt files) so I don't think the online > verification patch should change this. I'm also in the camp of "would rather it not exit immediately, so the extent of the issue is clear". > If we see complaints about this, then I'd be happy to change it > afterwards. I really don't think this is something we should change later on in a future release.. If the consensus is that there's really two different but valid use-cases then we should make it configurable, but I'm not convinced there is. > > Why not make this configurable, using a command-line option? > > I like this even less - this tool is about verifying checksums, so > adding options on what to do when it encounters broken pages looks out- > of-scope to me. Unless we want to say it should generally abort on the > first issue (i.e. on wrong checksums as well). I definitely disagree that it's somehow 'out of scope' for this tool to skip broken pages, when we can tell that they're broken. There is a question here about how to handle a short read since that can happen under normal conditions if we're unlucky. The same is also true for files disappearing entirely. So, let's talk/think through a few cases: A file with just 'foo\n' in it- could that be a page starting with an LSN around 666F6F0A that we somehow only read the first few bytes of? If not, why not? I could possibly see an argument that we expect to always get at least 512 bytes in a read, or 4K, but it seems like we could possibly run into edge cases on odd filesystems or such. In the end, I'm leaning towards categorizing different things, well, differently- a short read would be reported as a NOTICE or equivilant, perhaps, meaning that the test case needs to do something more than just have a file with 'foo' in it, but that is likely a good things anyway- the test cases would be better if they were closer to real world. Other read failures would be reported in a more serious category assuming they are "this really shouldn't happen" cases. A file disappearing isn't a "can't happen" case, and might be reported at the same 'NOTICE' level (or maybe with a 'verbose' ption). A file that's 8k in size and has a checksum but it's not right seems pretty clear to me. Might as well include a count of pages which have a valid checksum, I would think, though perhaps only in a 'verbose' mode would that get reported. A completely zero'd page could also be reported at a NOTICE level or with a count, or perhaps only with verbose. Other thoughts about use-cases and what should happen..? Thanks! Stephen
signature.asc
Description: PGP signature