Michael Guerin wrote:
Hi All,

   I've been getting these errors ("ERROR:  cache lookup failed for
relation 17442")  in my logs for a while now.   It originally seemed
like a hardware problem, however now we getting them pretty consistently
on a couple servers.  I've scalled down the schema to the one table and
the function involved and included a code snipet to make a bunch of
connections and loop around calling the same function.   It usually
takes 100-2000 iterations before these messages start appearing in the
log.  I've also included the original function, this takes 10,000
iterations for the error to start showing.   I should note, we've been
getting these erros since version 7, this is the first time they were
reproducable..

With the original function, the log messages were slightly different and
usually caused the server to reset:
i.e.
ERROR:  type "t" already exists
ERROR:  duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR:  duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR:  duplicate key violates unique constraint
"pg_type_typname_nsp_index"
CONTEXT:  SQL statement "create temp table tmp_children ( uniqid bigint,
memberid bigint, membertype varchar(50), ownerid smallint, tag
varchar(50), level int4 )"
    PL/pgSQL function "fngetcompositeids2" line 14 at SQL statement
ERROR:  duplicate key violates unique constraint
"pg_type_typname_nsp_index"
ERROR:  cache lookup failed for type 2449707570
FATAL:  cache lookup failed for type 2449707570

Environment info:  Postgres v8, suse linix with latest kernal patches,
filesystem: reiserfs.

Please let me know if you need anymore information.  No data is need,
just the schema included.

Thanks
Michael



Michael,

The interesting thing about this bug is: We had the same thing on a customer's machine some time ago. It actually occurred after a certain script (nothing big) was run the 100.001st time (maybe) on an empty database. So this one does not seem to be related to the schema - it is more or less random ...
The interesting thing is: We copied the data directory from the customer and we were not able to reproduce the same behaviour on a different machine.
The strange thing is: After doing a checkpoint and restarting the database the problem still occurred. Starting the same binary thing on a different machine did not show that error ...
We stepped through it with gdb but we could not find anything strange ...
Can you reliably reproduce the problem after a arbitrary amount of iterations on a different machine? We couldn't ...


Looking at the code: This is a null pointer caught by the system ...
Something seems to corrupt memory ...

        Hans

--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/660/816 40 77
www.cybertec.at, www.postgresql.at


---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to