On Fri, Jan 25, 2019 at 03:26:38PM +0100, Nick B wrote: > On server we see this error firing: "terminating walsender process due to > replication timeout" > The problem occurs during a network or file system acting very slow. One > example of such case looks like this (strace output for fsync calls): > > 0.033383 fsync(8) = 0 <20.265275> > 20.265399 fsync(8) = 0 <0.000011> > 0.022892 fsync(7) = 0 <48.350566> > 48.350654 fsync(7) = 0 <0.000005> > 0.000674 fsync(8) = 0 <0.851536> > 0.851619 fsync(8) = 0 <0.000007> > 0.000067 fsync(7) = 0 <0.000006> > 0.000045 fsync(7) = 0 <0.000005> > 0.031733 fsync(8) = 0 <0.826957> > 0.827869 fsync(8) = 0 <0.000016> > 0.005344 fsync(7) = 0 <1.437103> > 1.446450 fsync(6) = 0 <0.063148> > 0.063246 fsync(6) = 0 <0.000006> > 0.000381 +++ exited with 1 +++
These are a bit unregular. Which files are taking that long to complete while others are way faster? It may be something that we could improve on the base backup side as there is no actual point in syncing segments while the backup is running and we could delay that at the end of the backup (if I recall that stuff correctly). > This begs a question, why is the GUC handled the way it is? What would be > the correct way to solve this? Shall we change the behaviour or do a better > job explaining what are implications of wal_sender_timeout in the > docs? The following commit and thread are the ones you look for here: https://www.postgresql.org/message-id/506972b9.6060...@vmware.com commit: 6f60fdd7015b032bf49273c99f80913d57eac284 committer: Heikki Linnakangas <heikki.linnakan...@iki.fi> date: Thu, 11 Oct 2012 17:48:08 +0300 Improve replication connection timeouts. Rename replication_timeout to wal_sender_timeout, and add a new setting called wal_receiver_timeout that does the same at the walreceiver side. There was previously no timeout in walreceiver, so if the network went down, for example, the walreceiver could take a long time to notice that the connection was lost. Now with the two settings, both sides of a replication connection will detect a broken connection similarly. It is no longer necessary to manually set wal_receiver_status_interval to a value smaller than the timeout. Both wal sender and receiver now automatically send a "ping" message if more than 1/2 of the configured timeout has elapsed, and it hasn't received any messages from the other end. The docs could be improved to describe that better.. -- Michael
signature.asc
Description: PGP signature