My client has been seeing regular instances of the following sort of problem:
... 03:06:09.453 exec_simple_query, postgres.c:900 03:06:12.042 XX000: could not find pg_class tuple for index 2662 at character 13 03:06:12.042 RelationReloadIndexInfo, relcache.c:1740 03:06:12.042 INSERT INTO zzz_k(k) SELECT ... 03:06:12.045 00000: statement: ABORT 03:06:12.045 exec_simple_query, postgres.c:900 03:06:12.045 00000: duration: 0.100 ms 03:06:12.045 exec_simple_query, postgres.c:1128 03:06:12.046 00000: statement: INSERT INTO temp_807 VALUES (...) 03:06:12.046 exec_simple_query, postgres.c:900 03:06:12.046 XX000: could not find pg_class tuple for index 2662 at character 13 03:06:12.046 RelationReloadIndexInfo, relcache.c:1740 03:06:12.046 INSERT INTO temp_807 VALUES (...) 03:06:12.096 08P01: unexpected EOF on client connection 03:06:12.096 SocketBackend, postgres.c:348 03:06:12.096 XX000: could not find pg_class tuple for index 2662 03:06:12.096 RelationReloadIndexInfo, relcache.c:1740 03:06:12.121 00000: disconnection: session time: 0:06:08.537 user=ZZZ database=ZZZ_01 03:06:12.121 log_disconnections, postgres.c:4339 The above happens regularly (but not completely predictably) corresponding with a daily cronjob that checks the catalogs for bloat and does vacuum full and/or reindex as needed. Since some of the applications make very heavy use of temp tables this will usually mean pg_class and pg_index get vacuum full and reindex. Sometimes queries will fail due to being unable to open a tables containing file. On investigation the file will be absent in both the catalogs and the filesystem so I don't know what table it refers to: 20:41:19.063 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory 20:41:19.063 STATEMENT: insert into r_ar__30 select aid, mid, pid, sum(wdata) as wdata, ... -- 20:41:19.430 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory 20:41:19.430 STATEMENT: SELECT nextval('j_id_seq') Finallly, I have seen a several instances of failure to read data by vacuum full itself: 03:05:45.699 00000: statement: vacuum full pg_catalog.pg_index; 03:05:45.699 exec_simple_query, postgres.c:900 03:05:46.142 XX001: could not read block 65 in file "pg_tblspc/16401/PG_9.0_201008051/16416/1049146489": read only 0 of 8192 bytes 03:05:46.142 mdread, md.c:656 03:05:46.142 vacuum full pg_catalog.pg_index; This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have identical systems still running 8.4.8 that do not have this issue, so I'm assuming it is related to the vacuum full work done for 9.0. Oddly, we don't see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4, so it may be timing related. This seems possibly related to the issues in: Bizarre buildfarm failure on baiji: can't find pg_class_oid_index http://archives.postgresql.org/pgsql-hackers/2010-02/msg02038.php Broken HOT chains in system catalogs http://archives.postgresql.org/pgsql-hackers/2011-04/msg00777.php As far as I can tell from the logs I have, once a session sees one of these errors any subsequent query will hit it again until the session exits. However, it does not seem to harm other sessions or leave any persistant damage (crossing fingers and hoping here). I'm ready to do any testing/investigation/instrumented builds etc that may be helpful in resolving this. Regards -dg -- David Gould da...@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers