Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
On 02/22/2013 11:55 PM, j...@pgexperts.com wrote: > > While doing acceptance testing on a new Ubuntu 12.04 PostgreSQL server > running 9.2.3, we set checkpoint_segments = 128, > checkpoint_completion_target = 0.9 and placed pg_xlog on a separate 20G > partition. Also, archive_mode = off on this system. > > According to the docs, you would expect the system to attempt to keep the > WAL files down close to 3 * checkpoint_segments + 1. Unfortunately, this > does not appear to be the case because a pgbench run would run the pg_xlog > partition out of space. > [] Hello Jeff We reported this back in 2011, but we did not get to any conclusion: http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no In our case, we had this problem when creating a GIN index. I think the problem has something to do with checkpoints, I think the number of WAL files will grow beyond the limit defined because the system can not be finish with checkpoints. A good start to try to explain what it is happening would be to try to identify or describe the situations where checkpoints can take very long to complete or fail altogether. We are interested in this subject because we have had this problem a few times. But we have not been able to identify the reason that triggers this. regards, -- Rafael Martinez Guerrero Center for Information Technology University of Oslo, Norway PGP Public Key: http://folk.uio.no/rafael/ -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero wrote: > We reported this back in 2011, but we did not get to any conclusion: > http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no > > In our case, we had this problem when creating a GIN index. > > I think the problem has something to do with checkpoints, I think the > number of WAL files will grow beyond the limit defined because the > system can not be finish with checkpoints. A good start to try to > explain what it is happening would be to try to identify or describe the > situations where checkpoints can take very long to complete or fail > altogether. > > We are interested in this subject because we have had this problem a few > times. But we have not been able to identify the reason that triggers this. Rafael, did you do a followup post on pgsql-performance, or did the thread die at the end of the one you post above? --- Jeff Frost CTO, PostgreSQL Experts, Inc. Phone: 1-888-PG-EXPRT x506 FAX: 415-762-5122 http://www.pgexperts.com/
Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
On Feb 24, 2013, at 1:05 PM, Jeff Frost wrote: > > On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero > wrote: > >> We reported this back in 2011, but we did not get to any conclusion: >> http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no >> >> In our case, we had this problem when creating a GIN index. >> >> I think the problem has something to do with checkpoints, I think the >> number of WAL files will grow beyond the limit defined because the >> system can not be finish with checkpoints. A good start to try to >> explain what it is happening would be to try to identify or describe the >> situations where checkpoints can take very long to complete or fail >> altogether. >> >> We are interested in this subject because we have had this problem a few >> times. But we have not been able to identify the reason that triggers this. > > Rafael, did you do a followup post on pgsql-performance, or did the thread > die at the end of the one you post above? This is how the log_checkpoint output looks during the run: 2013-02-24 21:06:31.156 UTC,,,1624,,51282598.658,114,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint starting: immediate force wait","" 2013-02-24 21:06:31.216 UTC,,,1624,,51282598.658,115,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint complete: wrote 108 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=0.054 s, sync=0.002 s, total=0.059 s; sync files=10, longest=0.000 s, average=0.000 s","" So, this ^ is the manual checkpoint prior to the pgbench run and everything else is part of the pgbench run. 2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,116,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoints are occurring too frequently (14 seconds apart)",,"Consider increasing the configuration parameter ""checkpoint_segments"".",,,"" 2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,117,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint starting: xlog","" 2013-02-24 21:08:35.239 UTC,,,1624,,51282598.658,118,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint complete: wrote 266935 buffers (12.7%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=10.889 s, sync=99.079 s, total=110.011 s; sync files=29, longest=12.888 s, average=3.416 s","" 2013-02-24 21:08:35.241 UTC,,,1624,,51282598.658,119,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint starting: xlog","" 2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,120,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint complete: wrote 801517 buffers (38.2%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=15.484 s, sync=148.739 s, total=164.266 s; sync files=32, longest=11.773 s, average=4.648 s","" 2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,121,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint starting: xlog","" 2013-02-24 21:14:20.486 UTC,,,1624,,51282598.658,122,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint complete: wrote 1009466 buffers (48.1%); 0 transaction log file(s) added, 140 removed, 257 recycled; write=13.690 s, sync=167.245 s, total=180.980 s; sync files=33, longest=14.442 s, average=5.067 s","" 2013-02-24 21:14:20.487 UTC,,,1624,,51282598.658,123,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint starting: xlog","" 2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,124,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint complete: wrote 1055059 buffers (50.3%); 0 transaction log file(s) added, 259 removed, 257 recycled; write=22.731 s, sync=168.239 s, total=191.020 s; sync files=32, longest=13.650 s, average=5.257 s","" 2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,125,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint starting: xlog","" 2013-02-24 21:23:36.201 UTC,,,1624,,51282598.658,126,,2013-02-23 02:12:40 UTC,,0,LOG,0,"checkpoint complete: wrote 814767 buffers (38.9%); 0 transaction log file(s) added, 546 removed, 0 recycled; write=269.553 s, sync=95.042 s, total=364.693 s; sync files=31, longest=11.211 s, average=3.065 s","" They're all number of xlog based and the first one doesn't remove any, that's not surprising, but I'm surprised that the second one doesn't remove any. It does recycle quite a few. I guess the issue is that it can generate WAL files way faster than it can checkpoint and cleanup. I think I need to write a script to output the number of WAL files each minute so we can better correlate what's going on.
Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
On 02/24/2013 10:05 PM, Jeff Frost wrote: > > On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero > mailto:r.m.guerr...@usit.uio.no>> wrote: > >> We reported this back in 2011, but we did not get to any conclusion: >> http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no >> >> In our case, we had this problem when creating a GIN index. >> >> I think the problem has something to do with checkpoints, I think the >> number of WAL files will grow beyond the limit defined because the >> system can not be finish with checkpoints. A good start to try to >> explain what it is happening would be to try to identify or describe the >> situations where checkpoints can take very long to complete or fail >> altogether. >> >> We are interested in this subject because we have had this problem a few >> times. But we have not been able to identify the reason that triggers >> this. > > Rafael, did you do a followup post on pgsql-performance, or did the > thread die at the end of the one you post above? > Jeff, we didn't have the time to continue investigating this problem that time. Our workaround was to increase the size of our pg_xlog partition so we could finish the export/import job we had to do. regards, -- Rafael Martinez Guerrero Center for Information Technology University of Oslo, Norway PGP Public Key: http://folk.uio.no/rafael/ -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs