On 13-9-2005 20:04, Tom Lane wrote:
Arjen van der Meijden <[EMAIL PROTECTED]> writes:
On 13-9-2005 16:25, Tom Lane wrote:
Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).
Ah. So you've reindexed pg_class at some point. Reindexing it again
would likely get you out of this.
Unless reindexing is part of other commands, I didn't do that. The last
time 'grep' was able to find an reference to something being reindexed
was in June, something (maybe me, but I doubt it, I'd also reindex the
user-tables, I suppose) was reindexing all system tables back then.
Besides, its not just the index, on pg_class, pg_class itself (and
pg_index) have wrong LSN's as well.
Using pg_filedump I extracted the LSN for block 21 and indeed, that was
already 67713428 instead of something below 2E73E53C. It wasn't that
block alone though, here are a few LSN-lines from it:
LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0)
LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0)
LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0)
logid is the high-order half of the LSN, so there's nothing wrong with
those other pages --- it's only the first one you show there that seems
to be past the current end of WAL.
There were 3 blocks of 40 with a LSN like the first one above in that
index-file. So with high-order 41, recoff 0x67[67]something.
In the pg_class-file there were 6 blocks, of which 5 LSN's were like the
above in that index. And for pg_index 3 blocks, with 1 wrong.
On that day I did some active query-tuning, but a few times it took too
long, so I issued immediate shut downs when the selects took too long.
There were no warnings about broken records afterwards in the log
though, so I don't believe anything got damaged afterwards.
I have a feeling something may have gone wrong here, though it's hard to
say what. If the bogus pages in the other tables all have LSNs close to
this one then that makes it less likely that this is a random corruption
event --- what would be more plausible is that end of WAL really was
that high and somehow the WAL counter got reset back during one of those
forced restarts.
Can you show us ls -l output for the pg_xlog directory? I'm interested
to see the file names and mod dates there.
Here you go:
l /var/lib/postgresql/data/pg_xlog/
total 145M
drwx------ 3 postgres postgres 4.0K Sep 1 12:37 .
drwx------ 8 postgres postgres 4.0K Sep 13 20:31 ..
-rw------- 1 postgres postgres 16M Sep 13 19:25 00000001000000290000002E
-rw------- 1 postgres postgres 16M Sep 1 12:36 000000010000002900000067
-rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000068
-rw------- 1 postgres postgres 16M Aug 25 11:40 000000010000002900000069
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006A
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006B
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006C
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006D
-rw------- 1 postgres postgres 16M Aug 25 11:40 00000001000000290000006E
During data-load it was warning about too frequent checkpoints, but I do
hope thats mostly performance-related, not stability?
Best regards,
Arjen van der Meijden
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings