pg_basebackup, walreceiver and wal_sender_timeout

2019-01-25 Thread Nick B
Hi, hackers.

When running pg_basebackup with -X s with network file system as target we
would consistently get "could not receive data from WAL stream: server
closed the connection unexpectedly".

On server we see this error firing: "terminating walsender process due to
replication timeout"
The problem occurs during a network or file system acting very slow. One
example of such case looks like this (strace output for fsync calls):

0.033383 fsync(8)  = 0 <20.265275>
20.265399 fsync(8) = 0 <0.11>
0.022892 fsync(7)  = 0 <48.350566>
48.350654 fsync(7) = 0 <0.05>
0.000674 fsync(8)  = 0 <0.851536>
0.851619 fsync(8)  = 0 <0.07>
0.67 fsync(7)  = 0 <0.06>
0.45 fsync(7)  = 0 <0.05>
0.031733 fsync(8)  = 0 <0.826957>
0.827869 fsync(8)  = 0 <0.16>
0.005344 fsync(7)  = 0 <1.437103>
1.446450 fsync(6)  = 0 <0.063148>
0.063246 fsync(6)  = 0 <0.06>
0.000381 +++ exited with 1 +++

So the longest fsync call took 48 seconds, but how would that result in a
termination if wal_sender_timeout is (default) 60 seconds?

The problem is in the way wal_sender handles this timeout:

/*
* If half of wal_sender_timeout has lapsed without receiving any reply
* from the standby, send a keep-alive message to the standby requesting
* an immediate reply.
*/

Obviously the receiver cannot respond immediately while in a syscall.

This begs a question, why is the GUC handled the way it is? What would be
the correct way to solve this? Shall we change the behaviour or do a better
job explaining what are implications of wal_sender_timeout in the docs?

Regards,
Nick.


Re: pg_basebackup, walreceiver and wal_sender_timeout

2019-01-26 Thread Nick B
On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier  wrote:
> These are a bit unregular.  Which files are taking that long to
> complete while others are way faster?  It may be something that we
> could improve on the base backup side as there is no actual point in
> syncing segments while the backup is running and we could delay that
> at the end of the backup (if I recall that stuff correctly).

I don't have a good sample for these. One instance of this happening is below:

0.000125 fsync(7)  = 0 <0.016677>
0.39 fsync(7)  = 0 <0.05>
# 2048 writes for total of 16777216 bytes (16MB)
0.000618 write(7,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192 <0.21>
0.78 fsync(8)  = 0 <57.609720>
57.609830 fsync(8)  = 0 <0.07>

Again, it is a problem with our network file system that we are still
investigating. I'm not sure this can be improved easily, since
pg_basebackup shares this code with walreceiver.

> The docs could be improved to describe that better..

I will look into that.

Regards,
Nick.



Re: pg_basebackup, walreceiver and wal_sender_timeout

2019-01-29 Thread Nick B
Greetings,
I also would like to thank everyone for looking into this.

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:
> One workaround you could perhaps look at here is to run pg_basebackup
> with --no-sync. That way there will be no fsyncs issued while running. You
> will then of course have to take care of syncing all the files to disk
> after it's done, but a network filesystem might be happier in dealing with
> a large "batch-sync" like that rather than piece-by-piece sync.

Thanks for the pointer. I actually was not aware of the existence of this
flag. I've ran two rounds of tests with --no-sync and backup failed at a
much later point in time, which suggests that the bottleneck is in fact the
metadata server of ceph. We're now looking into ways of improving this.
(This is a 15TB cluster with a few hundred thousands tables which on
average generates 4 WAL segments per second, so throttling transfer rate is
not a good option either).

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier
 wrote:
> The docs could be improved to describe that better..

I had an off-list discussion of a possible documentation update with
Stephen Frost and he voiced an opinion that the behaviour I was trying to
describe sounds a lot like a bug and documenting that is not a good
practice.

Upon further examination of WalSndKeepaliveIfNecessary I found out that the
implementation of "requesting an immediate reply" is done by setting the
socket into non-blocking mode and issuing a flush. I find it hard to
believe there is a scenario where client can react to that keep-alive on
time (unless of course I misunderstood something). So the question is, will
we ever wait the actual wal_sender_timeout before terminating the
connection?

Regards,
Nick.