Re: [BUGS] Race-condition with failed block-write?

Arjen van der Meijden Tue, 13 Sep 2005 11:56:52 -0700

On 13-9-2005 20:04, Tom Lane wrote:

Arjen van der Meijden <[EMAIL PROTECTED]> writes:

On 13-9-2005 16:25, Tom Lane wrote:

Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).


Ah.  So you've reindexed pg_class at some point.  Reindexing it again
would likely get you out of this.

Unless reindexing is part of other commands, I didn't do that. The lasttime 'grep' was able to find an reference to something being reindexedwas in June, something (maybe me, but I doubt it, I'd also reindex theuser-tables, I suppose) was reindexing all system tables back then.Besides, its not just the index, on pg_class, pg_class itself (andpg_index) have wrong LSN's as well.

Using pg_filedump I extracted the LSN for block 21 and indeed, that wasalready 67713428 instead of something below 2E73E53C. It wasn't thatblock alone though, here are a few LSN-lines from it:

 LSN:  logid     41 recoff 0x676f5174      Special  8176 (0x1ff0)
 LSN:  logid     25 recoff 0x3c6c5504      Special  8176 (0x1ff0)
 LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
 LSN:  logid     41 recoff 0x2ea88190      Special  8176 (0x1ff0)
 LSN:  logid      1 recoff 0x68e2f660      Special  8176 (0x1ff0)
 LSN:  logid     41 recoff 0x2ea8a270      Special  8176 (0x1ff0)
 LSN:  logid      1 recoff 0x68e2f6a4      Special  8176 (0x1ff0)



logid is the high-order half of the LSN, so there's nothing wrong with
those other pages --- it's only the first one you show there that seems
to be past the current end of WAL.

There were 3 blocks of 40 with a LSN like the first one above in thatindex-file. So with high-order 41, recoff 0x67[67]something.In the pg_class-file there were 6 blocks, of which 5 LSN's were like theabove in that index. And for pg_index 3 blocks, with 1 wrong.

On that day I did some active query-tuning, but a few times it took toolong, so I issued immediate shut downs when the selects took too long.There were no warnings about broken records afterwards in the logthough, so I don't believe anything got damaged afterwards.
I have a feeling something may have gone wrong here, though it's hard to
say what.  If the bogus pages in the other tables all have LSNs close to
this one then that makes it less likely that this is a random corruption
event --- what would be more plausible is that end of WAL really was
that high and somehow the WAL counter got reset back during one of those
forced restarts.

Can you show us ls -l output for the pg_xlog directory?  I'm interested
to see the file names and mod dates there.


Here you go:

l /var/lib/postgresql/data/pg_xlog/
total 145M
drwx------  3 postgres postgres 4.0K Sep  1 12:37 .
drwx------  8 postgres postgres 4.0K Sep 13 20:31 ..
-rw-------  1 postgres postgres  16M Sep 13 19:25 00000001000000290000002E
-rw-------  1 postgres postgres  16M Sep  1 12:36 000000010000002900000067
-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000068
-rw-------  1 postgres postgres  16M Aug 25 11:40 000000010000002900000069
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006A
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006B
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006C
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006D
-rw-------  1 postgres postgres  16M Aug 25 11:40 00000001000000290000006E

During data-load it was warning about too frequent checkpoints, but I dohope thats mostly performance-related, not stability?


Best regards,

Arjen van der Meijden

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [BUGS] Race-condition with failed block-write?

Reply via email to