Re: [BUGS] Race-condition with failed block-write?

Arjen van der Meijden Wed, 14 Sep 2005 17:44:51 -0700

On 13-9-2005 16:25, Tom Lane wrote:

Arjen van der Meijden <[EMAIL PROTECTED]> writes:


It's highly unlikely that that query has anything to do with it, since
it's not touching anything but system catalogs and not trying to write
them either.


Indeed, other things trigger it as well.

The first thing you ought to find out is which table
1663/2013826/9975789 is, and look to see if the corrupted LSN value is

already present on disk in that block.


Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).

Using pg_filedump I extracted the LSN for block 21 and indeed, that wasalready 67713428 instead of something below 2E73E53C. It wasn't thatblock alone though, here are a few LSN-lines from it:


 LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
 LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
 LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
 LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
 LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
 LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
 LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)

I tried other files and each one I tried only had LSN's of 0.

When trying (\d indexname in psql) to determine to which table thatindex belonged I noticed it got the errors again, but for another file(pg_index this time). And another try (oid2name ...) after that, yetanother file (the pg_class-table). All those files where last changedsomewhere August 25, so now new changes.

On that day I did some active query-tuning, but a few times it took toolong, so I issued immediate shut downs when the selects took too long.There were no warnings about broken records afterwards in the logthough, so I don't believe anything got damaged afterwards.

After that I loaded some fresh data from a production-database usingeither pg_restore or psql < some-file-from-pg_dump.sql (I don't knowwhich one anymore). A few days later I shut down that postgres,installed 8.1-beta and used that (in another directory of course), this8.0.3 only came back up because of a reboot and wasn't used since thatreboot.


I guess, during that reloading those system tables got mixed up?

If it is, then we've probably
not got much chance of finding out how it got there.  If it is *not* on
disk, but you have a repeatable way of causing this to happen starting
from a clean postmaster start, then that's pretty interesting --- but
I don't know any way of figuring it out short of groveling through the
code with a debugger.  If you're not already pretty familiar with the PG
code, coaching you remotely isn't going to work very well :-(.  I'd be
glad to look into it if you can get me access to the machine though.

Well, I can very probably give you that access. But as you say, findingout was went wrong is very hard to do.


Best regards,

Arjen van der Meijden

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [BUGS] Race-condition with failed block-write?

Reply via email to