Hi, On 2022-01-19 13:34:21 -0500, Tom Lane wrote: > Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> writes: > > On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju...@gmail.com> wrote: > >> Maybe I'm missing something but wouldn't > >> https://commitfest.postgresql.org/36/3448/ better solve the problem? > > > The error can cause the new background process proposed there in that > > thread to restart, which is again costly. Since we have LOG-only and > > continue behavior in CheckPointSnapBuild already, having the same > > behavior for CheckPointLogicalRewriteHeap helps a lot. > > [ stares at CheckPointLogicalRewriteHeap for awhile ... ] > > This code has got more problems than that. It took me awhile to > absorb it, but we don't actually care about the contents of any of > those files; all of the information is encoded in the file *names*.
I'm not following - we *do* need the contents of the files? They're applied as-needed in ApplyLogicalMappingFile(). > (This strikes me as probably not a very efficient design, compared > to putting the same data into a single text file; but for now I'll > assume we're not up for a complete rewrite.) That being the case, > I wonder what it is we expect fsync'ing the surviving files to do > exactly. We should be fsync'ing the directory not the files > themselves, no? Fsyncing the directory doesn't guarantee anything about the contents of files. But, you're right, we need an fsync of the directory too. > Other things that seem poorly thought out: > > * Why is the check for "map-" prefix after, rather than before, > the lstat? It doesn't seem to matter much - there shouldn't be a meaningful amount of other files in there. > * Why is it okay to ignore lstat failure? Seems like we might > as well not even have the lstat. Yea, that seems odd, not sure why that ended up this way. I guess the aim might have been to tolerate random files we don't have permissions for or such? > * The sscanf on the file name would not notice trailing junk, > such as an editor backup marker. Is that okay? I don't really see a problem with it - there shouldn't be other files matching the pattern - but it couldn't hurt to check the pattern matches exhaustively. > As far as the patch itself goes, I agree that failure to unlink > is noncritical, because such a file would have no further effect > and we can just ignore it. I don't agree. We iterate through the directory regularly on systems with catalog changes + logical decoding. An ever increasing list of gunk will make that more and more expensive. And I haven't heard a meaningful reason why we would have map-* files that we can't remove. Ignoring failures like this just makes problems much harder to debug and they tend to bite harder for it. > I think I also agree that failure of the sscanf is noncritical, because the > implication of that is that the file name doesn't conform to our > expectations, which means it's basically just like the check that causes us > to ignore names not starting with "map-". (Actually, isn't the separate > check for "map-" useless, given that sscanf will make the equivalent check?) Well, this way only files starting with "map-" are expected to conform to a strict format, the rest is ignored? > Anyway, I think possibly we can drop the bottom half of the loop > (the part trying to fsync non-removed files) in favor of fsync'ing > the directory afterwards. Thoughts? I don't think that'd be correct. In short: We should add a directory fsync, I'm fine with improving the error checking, but the rest seems like a net-negative with no convincing reasoning. Greetings, Andres Freund