I wrote: >> I've been chasing the intermittent "cache lookup failed for access method >> 403" failure at session startup that's been seen lately in the buildfarm, >> for instance here: >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-06-04%2019%3A22%3A46 >> (Axolotl has shown this 3 times in the last 90 days, not sure if any >> others have seen it.) I hypothesized that this was triggered by the >> "VACUUM FULL pg_am" in the concurrently running vacuum.sql regression >> test, so I started running the regression tests in parallel with a >> shell script doing >> while sleep 0.1; do psql -c 'vacuum full pg_am' regression; done >> and sure enough, I can reproduce it once in awhile.
BTW, while this recipe works in HEAD (it might take 40 or 50 cycles of the regression tests, but it works), I've been unable to reproduce the failure this way in any back branch. I'm not entirely sure why, but I'm suspicious that it's because HEAD has more tests running concurrently with vacuum.sql than there used to be. However, I can reproduce the failure with 100% reliability in all branches by injecting a conditional cache reset just before writing the init file. For example, apply this patch: diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index f60f3cb..846d880 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *************** RelationCacheInitializePhase3(void) *** 3489,3494 **** --- 3489,3498 ---- */ InitCatalogCachePhase2(); + /* If superuser, force cache flush */ + if (superuser()) + RelationCacheInvalidate(); + /* reset initFileRelationIds list; we'll fill it during write */ initFileRelationIds = NIL; then create at least one non-superuser user, then do this: 1. Connect as superuser and do "vacuum full pg_am". (This causes the current relcache init file to be removed.) 2. Disconnect and reconnect as superuser. (The new session goes through the above logic, so that it sees a relcache flush occur just at the critical time. It correctly doesn't write a bogus init file, but it's left with an initFileRelationIds list with only 11 entries.) 3. In another window, connect as a non-superuser user. (This session will successfully write a new init file, since the above hack doesn't trigger in it.) 4. In the session from step 2, again do "vacuum full pg_am". (Now the bug manifests: we fail to remove the init file although it is now stale.) 5. Now things are broken: all new sessions fail with psql: FATAL: cache lookup failed for access method 403 So that's mainly for the archives, to document a reliable way to test the problem. However, there's an interesting takeaway from this. Since this problem is triggered by a cache flush at just the wrong time, you might hope that the buildfarm's CLOBBER_CACHE_ALWAYS critters would have caught it. The reason they fail to do so is that, because they *always* flush at every opportunity, the relcache write code *always* fails and so such a build never creates an init file that could then become stale. This suggests that CLOBBER_CACHE_ALWAYS is actually missing a pretty large part of the cache behavioral space. Maybe we should devise some sort of "CLOBBER_CACHE_RANDOMLY" option that would inject cache flush events more selectively, perhaps only once every thousand opportunities or so. And perhaps not only full cache reset events, though I confess to not being sure what that ought to look like. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers