My client has been seeing regular instances of the following sort of problem:

...
 03:06:09.453 exec_simple_query, postgres.c:900
 03:06:12.042 XX000: could not find pg_class tuple for index 2662 at character 
13
 03:06:12.042 RelationReloadIndexInfo, relcache.c:1740
 03:06:12.042 INSERT INTO zzz_k(k) SELECT ...
 03:06:12.045 00000: statement: ABORT
 03:06:12.045 exec_simple_query, postgres.c:900
 03:06:12.045 00000: duration: 0.100 ms
 03:06:12.045 exec_simple_query, postgres.c:1128
 03:06:12.046 00000: statement: INSERT INTO temp_807
                      VALUES (...)
 03:06:12.046 exec_simple_query, postgres.c:900
 03:06:12.046 XX000: could not find pg_class tuple for index 2662 at character 
13
 03:06:12.046 RelationReloadIndexInfo, relcache.c:1740
 03:06:12.046 INSERT INTO temp_807
                      VALUES (...)
 03:06:12.096 08P01: unexpected EOF on client connection
 03:06:12.096 SocketBackend, postgres.c:348
 03:06:12.096 XX000: could not find pg_class tuple for index 2662
 03:06:12.096 RelationReloadIndexInfo, relcache.c:1740
 03:06:12.121 00000: disconnection: session time: 0:06:08.537 user=ZZZ 
database=ZZZ_01
 03:06:12.121 log_disconnections, postgres.c:4339


The above happens regularly (but not completely predictably) corresponding
with a daily cronjob that checks the catalogs for bloat and does vacuum full
and/or reindex as needed. Since some of the applications make very heavy
use of temp tables this will usually mean pg_class and pg_index get vacuum
full and reindex.

Sometimes queries will fail due to being unable to open a tables containing
file. On investigation the file will be absent in both the catalogs and the
filesystem so I don't know what table it refers to:

 20:41:19.063  ERROR:  could not open file 
"pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory
 20:41:19.063  STATEMENT:  insert into r_ar__30
                       select aid, mid, pid, sum(wdata) as wdata, ...
--
 20:41:19.430  ERROR:  could not open file 
"pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory
 20:41:19.430  STATEMENT: SELECT nextval('j_id_seq')


Finallly, I have seen a several instances of failure to read data by
vacuum full itself:

 03:05:45.699 00000: statement: vacuum full pg_catalog.pg_index;
 03:05:45.699 exec_simple_query, postgres.c:900
 03:05:46.142 XX001: could not read block 65 in file 
"pg_tblspc/16401/PG_9.0_201008051/16416/1049146489": read only 0 of 8192 bytes
 03:05:46.142 mdread, md.c:656
 03:05:46.142 vacuum full pg_catalog.pg_index;

This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
identical systems still running 8.4.8 that do not have this issue, so I'm
assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
so it may be timing related.

This seems possibly related to the issues in:

  Bizarre buildfarm failure on baiji: can't find pg_class_oid_index
    http://archives.postgresql.org/pgsql-hackers/2010-02/msg02038.php
  Broken HOT chains in system catalogs
    http://archives.postgresql.org/pgsql-hackers/2011-04/msg00777.php

As far as I can tell from the logs I have, once a session sees one of these
errors any subsequent query will hit it again until the session exits.
However, it does not seem to harm other sessions or leave any persistant
damage (crossing fingers and hoping here).

I'm ready to do any testing/investigation/instrumented builds etc that may be
helpful in resolving this.

Regards

-dg

-- 
David Gould       da...@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to