Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

2013-02-24 Thread Rafael Martinez Guerrero
On 02/22/2013 11:55 PM, j...@pgexperts.com wrote:
> 
> While doing acceptance testing on a new Ubuntu 12.04 PostgreSQL server
> running 9.2.3, we set checkpoint_segments = 128,
> checkpoint_completion_target = 0.9 and placed pg_xlog on a separate 20G
> partition. Also, archive_mode = off on this system.
> 
> According to the docs, you would expect the system to attempt to keep the
> WAL files down close to 3 * checkpoint_segments + 1.  Unfortunately, this
> does not appear to be the case because a pgbench run would run the pg_xlog
> partition out of space.
> 
[]

Hello Jeff

We reported this back in 2011, but we did not get to any conclusion:
http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no

In our case, we had this problem when creating a GIN index.

I think the problem has something to do with checkpoints, I think the
number of WAL files will grow beyond the limit defined because the
system can not be finish with checkpoints. A good start to try to
explain what it is happening would be to try to identify or describe the
situations where checkpoints can take very long to complete or fail
altogether.

We are interested in this subject because we have had this problem a few
times. But we have not been able to identify the reason that triggers this.

regards,
-- 
Rafael Martinez Guerrero
Center for Information Technology
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

2013-02-24 Thread Jeff Frost

On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero 
 wrote:

> We reported this back in 2011, but we did not get to any conclusion:
> http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no
> 
> In our case, we had this problem when creating a GIN index.
> 
> I think the problem has something to do with checkpoints, I think the
> number of WAL files will grow beyond the limit defined because the
> system can not be finish with checkpoints. A good start to try to
> explain what it is happening would be to try to identify or describe the
> situations where checkpoints can take very long to complete or fail
> altogether.
> 
> We are interested in this subject because we have had this problem a few
> times. But we have not been able to identify the reason that triggers this.

Rafael, did you do a followup post on pgsql-performance, or did the thread die 
at the end of the one you post above?


---
Jeff Frost 
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 








Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

2013-02-24 Thread Jeff Frost

On Feb 24, 2013, at 1:05 PM, Jeff Frost  wrote:

> 
> On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero 
>  wrote:
> 
>> We reported this back in 2011, but we did not get to any conclusion:
>> http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no
>> 
>> In our case, we had this problem when creating a GIN index.
>> 
>> I think the problem has something to do with checkpoints, I think the
>> number of WAL files will grow beyond the limit defined because the
>> system can not be finish with checkpoints. A good start to try to
>> explain what it is happening would be to try to identify or describe the
>> situations where checkpoints can take very long to complete or fail
>> altogether.
>> 
>> We are interested in this subject because we have had this problem a few
>> times. But we have not been able to identify the reason that triggers this.
> 
> Rafael, did you do a followup post on pgsql-performance, or did the thread 
> die at the end of the one you post above?

This is how the log_checkpoint output looks during the run:

2013-02-24 21:06:31.156 UTC,,,1624,,51282598.658,114,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint starting: immediate force wait",""
2013-02-24 21:06:31.216 UTC,,,1624,,51282598.658,115,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint complete: wrote 108 buffers (0.0%); 0 transaction 
log file(s) added, 0 removed, 4 recycled; write=0.054 s, sync=0.002 s, 
total=0.059 s; sync files=10, longest=0.000 s, average=0.000 s",""

So, this ^ is the manual checkpoint prior to the pgbench run and everything 
else is part of the pgbench run.

2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,116,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoints are occurring too frequently (14 seconds 
apart)",,"Consider increasing the configuration parameter 
""checkpoint_segments"".",,,""
2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,117,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint starting: xlog",""
2013-02-24 21:08:35.239 UTC,,,1624,,51282598.658,118,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint complete: wrote 266935 buffers (12.7%); 0 
transaction log file(s) added, 0 removed, 1 recycled; write=10.889 s, 
sync=99.079 s, total=110.011 s; sync files=29, longest=12.888 s, average=3.416 
s",""
2013-02-24 21:08:35.241 UTC,,,1624,,51282598.658,119,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint starting: xlog",""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,120,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint complete: wrote 801517 buffers (38.2%); 0 
transaction log file(s) added, 0 removed, 128 recycled; write=15.484 s, 
sync=148.739 s, total=164.266 s; sync files=32, longest=11.773 s, average=4.648 
s",""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,121,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint starting: xlog",""
2013-02-24 21:14:20.486 UTC,,,1624,,51282598.658,122,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint complete: wrote 1009466 buffers (48.1%); 0 
transaction log file(s) added, 140 removed, 257 recycled; write=13.690 s, 
sync=167.245 s, total=180.980 s; sync files=33, longest=14.442 s, average=5.067 
s",""
2013-02-24 21:14:20.487 UTC,,,1624,,51282598.658,123,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint starting: xlog",""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,124,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint complete: wrote 1055059 buffers (50.3%); 0 
transaction log file(s) added, 259 removed, 257 recycled; write=22.731 s, 
sync=168.239 s, total=191.020 s; sync files=32, longest=13.650 s, average=5.257 
s",""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,125,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint starting: xlog",""
2013-02-24 21:23:36.201 UTC,,,1624,,51282598.658,126,,2013-02-23 02:12:40 
UTC,,0,LOG,0,"checkpoint complete: wrote 814767 buffers (38.9%); 0 
transaction log file(s) added, 546 removed, 0 recycled; write=269.553 s, 
sync=95.042 s, total=364.693 s; sync files=31, longest=11.211 s, average=3.065 
s",""

They're all number of xlog based and the first one doesn't remove any, that's 
not surprising, but I'm surprised that the second one doesn't remove any.  It 
does recycle quite a few.  I guess the issue is that it can generate WAL files 
way faster than it can checkpoint and cleanup.  I think I need to write a 
script to output the number of WAL files each minute so we can better correlate 
what's going on.

Re: [BUGS] BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

2013-02-24 Thread Rafael Martinez Guerrero
On 02/24/2013 10:05 PM, Jeff Frost wrote:
> 
> On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero
> mailto:r.m.guerr...@usit.uio.no>> wrote:
> 
>> We reported this back in 2011, but we did not get to any conclusion:
>> http://www.postgresql.org/message-id/4de89072.7070...@usit.uio.no
>>
>> In our case, we had this problem when creating a GIN index.
>>
>> I think the problem has something to do with checkpoints, I think the
>> number of WAL files will grow beyond the limit defined because the
>> system can not be finish with checkpoints. A good start to try to
>> explain what it is happening would be to try to identify or describe the
>> situations where checkpoints can take very long to complete or fail
>> altogether.
>>
>> We are interested in this subject because we have had this problem a few
>> times. But we have not been able to identify the reason that triggers
>> this.
> 
> Rafael, did you do a followup post on pgsql-performance, or did the
> thread die at the end of the one you post above?
> 

Jeff, we didn't have the time to continue investigating this problem
that time. Our workaround was to increase the size of our pg_xlog
partition so we could finish the export/import job we had to do.

regards,
-- 
Rafael Martinez Guerrero
Center for Information Technology
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs