On Thu, Mar 4, 2021 at 5:39 PM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > I think Robert mistook why I was doing that. I was thinking about a > different usage pattern. If somebody thinks a subset of relations have been > badly corrupted, but doesn't know which relations those might be, they might > try to find them with pg_amcheck, but wanting to just check the first few > blocks per relation in order to sample the relations. So, > > pg_amcheck --startblock=0 --endblock=9 --no-dependent-indexes > > or something like that. I don't think it's very fun to have it error out for > each relation that doesn't have at least ten blocks, nor is it fun to have > those relations skipped by error'ing out before checking any blocks, as they > might be the corrupt relations you are looking for. But using --startblock > and --endblock for this is not a natural fit, as evidenced by how I was > trying to "fix things up" for the user, so I'll punt on this usage until some > future version, when I might add a sampling option.
I admit I hadn't thought of that use case. I guess somebody could want to do that, but it doesn't seem all that useful. Checking the first up-to-ten blocks of every relation is not a very representative sample, and it's not clear to me that sampling is a good idea even if it were representative. What good is it to know that 10% of my database is probably not corrupted? On the other hand, people want to do all kinds of things that seem strange to me, and this might be another one. But, if that's so, then I think the right place to implement it is in amcheck itself, not pg_amcheck. I think pg_amcheck should be, now and in the future, a thin wrapper around the functionality provided by amcheck, just providing target selection and parallel execution. If you put something into pg_amcheck that figures out how long the relation is and runs it on some of the blocks, that functionality is only accessible to people who are accessing amcheck via pg_amcheck. If you put it in amcheck itself and just expose it through pg_amcheck, then it's accessible either way. It's probably cleaner and more performant to do it that way, too. So if you did add a sampling option in the future, that's the way I would recommend doing it, but I think it is probably best not to go there right now. -- Robert Haas EDB: http://www.enterprisedb.com