pg_basebackup, walreceiver and wal_sender_timeout
Hi, hackers. When running pg_basebackup with -X s with network file system as target we would consistently get "could not receive data from WAL stream: server closed the connection unexpectedly". On server we see this error firing: "terminating walsender process due to replication timeout" The problem occurs during a network or file system acting very slow. One example of such case looks like this (strace output for fsync calls): 0.033383 fsync(8) = 0 <20.265275> 20.265399 fsync(8) = 0 <0.11> 0.022892 fsync(7) = 0 <48.350566> 48.350654 fsync(7) = 0 <0.05> 0.000674 fsync(8) = 0 <0.851536> 0.851619 fsync(8) = 0 <0.07> 0.67 fsync(7) = 0 <0.06> 0.45 fsync(7) = 0 <0.05> 0.031733 fsync(8) = 0 <0.826957> 0.827869 fsync(8) = 0 <0.16> 0.005344 fsync(7) = 0 <1.437103> 1.446450 fsync(6) = 0 <0.063148> 0.063246 fsync(6) = 0 <0.06> 0.000381 +++ exited with 1 +++ So the longest fsync call took 48 seconds, but how would that result in a termination if wal_sender_timeout is (default) 60 seconds? The problem is in the way wal_sender handles this timeout: /* * If half of wal_sender_timeout has lapsed without receiving any reply * from the standby, send a keep-alive message to the standby requesting * an immediate reply. */ Obviously the receiver cannot respond immediately while in a syscall. This begs a question, why is the GUC handled the way it is? What would be the correct way to solve this? Shall we change the behaviour or do a better job explaining what are implications of wal_sender_timeout in the docs? Regards, Nick.
Re: pg_basebackup, walreceiver and wal_sender_timeout
On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier wrote: > These are a bit unregular. Which files are taking that long to > complete while others are way faster? It may be something that we > could improve on the base backup side as there is no actual point in > syncing segments while the backup is running and we could delay that > at the end of the backup (if I recall that stuff correctly). I don't have a good sample for these. One instance of this happening is below: 0.000125 fsync(7) = 0 <0.016677> 0.39 fsync(7) = 0 <0.05> # 2048 writes for total of 16777216 bytes (16MB) 0.000618 write(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 <0.21> 0.78 fsync(8) = 0 <57.609720> 57.609830 fsync(8) = 0 <0.07> Again, it is a problem with our network file system that we are still investigating. I'm not sure this can be improved easily, since pg_basebackup shares this code with walreceiver. > The docs could be improved to describe that better.. I will look into that. Regards, Nick.
Re: pg_basebackup, walreceiver and wal_sender_timeout
Greetings, I also would like to thank everyone for looking into this. On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote: > One workaround you could perhaps look at here is to run pg_basebackup > with --no-sync. That way there will be no fsyncs issued while running. You > will then of course have to take care of syncing all the files to disk > after it's done, but a network filesystem might be happier in dealing with > a large "batch-sync" like that rather than piece-by-piece sync. Thanks for the pointer. I actually was not aware of the existence of this flag. I've ran two rounds of tests with --no-sync and backup failed at a much later point in time, which suggests that the bottleneck is in fact the metadata server of ceph. We're now looking into ways of improving this. (This is a 15TB cluster with a few hundred thousands tables which on average generates 4 WAL segments per second, so throttling transfer rate is not a good option either). On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier wrote: > The docs could be improved to describe that better.. I had an off-list discussion of a possible documentation update with Stephen Frost and he voiced an opinion that the behaviour I was trying to describe sounds a lot like a bug and documenting that is not a good practice. Upon further examination of WalSndKeepaliveIfNecessary I found out that the implementation of "requesting an immediate reply" is done by setting the socket into non-blocking mode and issuing a flush. I find it hard to believe there is a scenario where client can react to that keep-alive on time (unless of course I misunderstood something). So the question is, will we ever wait the actual wal_sender_timeout before terminating the connection? Regards, Nick.