On Mon, Sep 25, 2017 at 12:10 PM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Sat, Sep 23, 2017 at 4:13 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> Somebody inserted this into vacuum.c's get_rel_oids(): >> >> tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid)); >> if (!HeapTupleIsValid(tuple)) >> elog(ERROR, "cache lookup failed for relation %u", relid); >> >> apparently without having read the very verbose comment two lines above, >> which points out that we're not taking any lock on the target relation. >> So, if that relation is concurrently being dropped, you're likely to >> get "cache lookup failed for relation NNNN" rather than anything more >> user-friendly. > > This has been overlooked during the lookups of 3c3bb99, and by > multiple people including me. elog() should never be things users can > face, as well as cache lookups.
FWIW, the same thing can happen when specifying an invalid replication origin name to pg_replication_origin_advance() and pg_replication_origin_progress(). These probably should fixed as well. > >> A minimum-change fix would be to replace the elog() with an ereport >> that produces the same "relation does not exist" error you'd have >> gotten from RangeVarGetRelid, had the concurrent DROP TABLE committed >> a few microseconds earlier. But that feels like its's band-aiding >> around the problem. > > Yeah, that's not right. This is a cache lookup error at the end. > >> What I'm wondering about is changing the RangeVarGetRelid call to take >> ShareUpdateExclusiveLock rather than no lock. That would protect the >> syscache lookup, and it would also make the find_all_inheritors call >> a lot more meaningful. >> >> If we're doing a VACUUM, the ShareUpdateExclusiveLock would be dropped >> as soon as we close the caller's transaction, and then we'd acquire >> the same or stronger lock inside vacuum_rel(). So that seems fine. >> If we're doing an ANALYZE, then the lock would continue to be held >> and analyze_rel would merely be acquiring it an extra time, so we'd >> actually be removing a race-condition failure scenario for ANALYZE. >> This would mean a few more cycles in lock management, but since this >> only applies to a manual VACUUM or ANALYZE that specifies a table >> name, I'm not too concerned about that. > > I think that I am +1 on that. Testing such a thing I am not seeing > anything wrong either. The call to find_all_inheritors should also use > ShareUpdateExclusiveLock, and the lock on top of RangeVarGetRelid() > needs to be reworked. > > Attached is a proposal of patch. FWIW, I agreed the approach of this patch, and found no problems in the patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers