Alvaro Herrera <alvhe...@alvh.no-ip.org> writes: > On 2021-Apr-08, Tom Lane wrote: >> Looks like this has issues under EXEC_BACKEND: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2021-04-08%2005%3A50%3A08
> Hmm, I couldn't reproduce this under EXEC_BACKEND or otherwise, but I > think this is unrelated to that, but rather a race condition. Yeah. I hit this on another machine that isn't using EXEC_BACKEND, and I concur it looks more like a race condition. I think the problem is that autovacuum is calling find_all_inheritors() on a relation it has no lock on, contrary to that function's API spec. find_all_inheritors assumes the OID it's given is valid and locked, and adds it to the result list automatically. Then it looks for children, and won't find any in the race case where somebody else just dropped the table. So we come back to relation_needs_vacanalyze with a list of just the original rel OID, and since this loop believes that every syscache fetch it does will succeed, kaboom. I do not think it is sane to do find_all_inheritors() with no lock, so I'd counsel doing something about that rather than band-aiding the symptom. On the other hand, it's also not really okay not to have an if-test-and-elog after the SearchSysCache call. "A cache lookup cannot fail" is not an acceptable assumption in my book. BTW, another thing that looks like a race condition is the extract_autovac_opts() call that is done a little bit earlier, also without lock. I think this is actually safe, but it's ONLY safe because we resisted the calls by certain people to add a toast table to pg_class. Otherwise, fetching reloptions could have involved a toast pointer dereference, and it would then be racy whether the toasted data was still there. As-is, even if the pg_class row we're looking at has been deleted, we can safely disassemble its reloptions. I think this matter is deserving of a comment at least. regards, tom lane