Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> writes: > On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju...@gmail.com> wrote: >> Maybe I'm missing something but wouldn't >> https://commitfest.postgresql.org/36/3448/ better solve the problem?
> The error can cause the new background process proposed there in that > thread to restart, which is again costly. Since we have LOG-only and > continue behavior in CheckPointSnapBuild already, having the same > behavior for CheckPointLogicalRewriteHeap helps a lot. [ stares at CheckPointLogicalRewriteHeap for awhile ... ] This code has got more problems than that. It took me awhile to absorb it, but we don't actually care about the contents of any of those files; all of the information is encoded in the file *names*. (This strikes me as probably not a very efficient design, compared to putting the same data into a single text file; but for now I'll assume we're not up for a complete rewrite.) That being the case, I wonder what it is we expect fsync'ing the surviving files to do exactly. We should be fsync'ing the directory not the files themselves, no? Other things that seem poorly thought out: * Why is the check for "map-" prefix after, rather than before, the lstat? * Why is it okay to ignore lstat failure? Seems like we might as well not even have the lstat. * The sscanf on the file name would not notice trailing junk, such as an editor backup marker. Is that okay? As far as the patch itself goes, I agree that failure to unlink is noncritical, because such a file would have no further effect and we can just ignore it. I think I also agree that failure of the sscanf is noncritical, because the implication of that is that the file name doesn't conform to our expectations, which means it's basically just like the check that causes us to ignore names not starting with "map-". (Actually, isn't the separate check for "map-" useless, given that sscanf will make the equivalent check?) I started out wondering why the patch didn't also change the loop's other ERROR conditions to LOG. But we do want to ERROR if we're unable to sync transient state down to disk, and that is what the other steps (think they) are doing. It might be worth a comment to point that out though, before someone decides that if these errors are just LOG level then the others can be too. Anyway, I think possibly we can drop the bottom half of the loop (the part trying to fsync non-removed files) in favor of fsync'ing the directory afterwards. Thoughts? regards, tom lane