Greetings.

First and foremost - sorry for my bad english.
I have PostgreSQL 9.0.4 installation using streaming replication, 1 master and 
10 replicas. I've migrated to it about a month ago, from 8.3 and slony. 
Built-in replication is wonderful, slony sucks =), but I have some performance 
issues. In order to manage with them I've tried lots of different things. One 
of them is rising shared_burffers from 8Gb to 12Gb (I have 24Gb at all). After 
postgres restart he crahed in about 3 hours with messages:

[2301-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 34691 in file 
"base/17931/407169": read only 160 of 8192 bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2390-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 2242 in file 
"base/17931/18984": read only 160 of 8192 bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT:[some statement]
[2302-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 34691 in file 
"base/17931/407169": read only 160 of 8192 bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2391-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file 
"base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2392-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file 
"base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2393-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file 
"base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2394-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 25578 in file 
"base/17931/686571": Bad address
[2394-2] 2011-06-16 17:40:27 UTC STATEMENT: [some statement]
[4-1] 2011-06-16 17:40:27 UTC LOG:  startup process (PID 15292) was terminated 
by signal 7: Bus error
[5-1] 2011-06-16 17:40:27 UTC LOG:  terminating any other active server 
processes
[858-1] 2011-06-16 17:40:27 UTC WARNING:  terminating connection because of 
crash of another server process
[858-2] 2011-06-16 17:40:27 UTC DETAIL:  The postmaster has commanded this 
server process to roll back the current transaction and exit, because another 
server process exited abnormally and possibly corrupted shared memory.
[2429-1] 2011-06-16 17:40:27 UTC WARNING:  terminating connection because of 
crash of another server process
[858-3] 2011-06-16 17:40:27 UTC HINT:  In a moment you should be able to 
reconnect to the database and repeat your command.

Despite "In a moment you should be able to reconnect to the database and repeat 
your command" postgres did nor restart by itself.
Signal 7 means  hardware problems. But all 10 replicas crashed within 10 
minutes, say from 13:35 to 13:45. And when shared buffers were set to 8Gb  I 
hadn't experienced such troubles.
I checked /proc/sys/kernel/shmmax - 16Gb, /proc/sys/kernel/shmall - 4194304.
I tried to set vm.overcommit_memory=2 and vm.overcommit_ratio=90 - no sense. 
Then I tried vm.overcommit_memory=1 - no sense.
Here are vmstat logs (vmstat 10 > vmstat.log):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 5308644      0      0    0    0   508   133    0  743  2  1 97  0
 0  0      0 5279436      0      0    0    0   620   307    0 1016  3  0 97  0
 1  0      0 5266384      0      0    0    0   242   223    0 1104  8  1 92  0
 1  0      0 5260544      0      0    0    0   359   287    0  959  2  0 98  0
 0  0      0 5245764      0      0    0    0   877   311    0  891  3  0 97  0
 0  0      0 24378380      0      0    0    0   454   314    0 1011  2  7 91  0
 0  0      0 24378380      0      0    0    0     0    69    0  118  0  0 100  0
 0  0      0 24378380      0      0    0    0     0     2    0   99  0  0 100  0
 0  0      0 24378380      0      0    0    0     0     1    0  180  0  0 100  0
 0  0      0 24378024      0      0    0    0     0     1    0  116  0  0 100  0
 0  0      0 24378024      0      0    0    0     0   342    0  126  0  0 100  0

So you can see that postgres dies when there are about 5Gb of free memory. In 
one of my experiments postgres worked for about 30 minutes with 0kb free 
memory. So I think problem is not memory lack.
I thought about bad indexes and reindex. In messages like "could not read block 
XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes" YYYYY is my 
database and ZZZZZ somehow relates to a table or to an index. It may be 
relfilenode of an index or oid of a table. And indexes and tables can be of any 
type, there are no regularity. I haven't tried to reindex, because after all 10 
replicas went down. I made all experiments on only one of them, while 
reindexing will affect all installation, which is in production and under high 
load.
One important thing - all replicas and master are running on openvz ubuntu 
lucid 2.6.32, and there is no way to reject virtualization (it is a long story 
=))

Please, I do not want to discuss my decision to set buffers to 12Gb and 
postgresql optimization at all. I just want to undestand why I'm getting such 
errors.
I will appreciate any help, thank you in advance.

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to