On Wed, Feb 12, 2020 at 4:09 PM Tom Lane <t...@sss.pgh.pa.us> wrote:

> =?UTF-8?Q?Mladen_Marinovi=C4=87?= <mladen.marino...@kset.org> writes:
> > Recently I am having some strange problems with pg_basebackup. About
> once a
> > week the backup process ends with an error message like this:
> > 2020-02-11 23:25:40 UTC [25790]: [1-1] user=replicator,db=[unknown] LOG:
> >  could not send data to client: Connection reset by peer
>
> Hmmm ....
>
> > The problem started occurring after a hardware (RAM + SSD) upgrade and an
> > OS Upgrade to Ubuntu 18.04. Both the server and backup process run in
> > separate docker containers on the same machine. This happens randomly on
> > multiple servers with the same configuration and it is probably not
> > hardware related. Also, this happens evenly on 9.4 and 9.6, and using the
> > same docker images that worked flawlessly on the previous installation.
> > I have been investigating the issue for at least a month and found no
> > problems in any log or metric before or after the event. I suspect that
> > this is related to some OS/docker parameter that is not well configured.
>
> How long does the backup run before failing?  If the connection were going
> between different machines my suspicions would lean toward a network
> timeout.  That seems somewhat unlikely in this configuration, but you
> never know.
>

The backup started at 23:00, and it copied 363GB by the time the connection
was closed. It usually takes about 2 hours for the entire database (cca.
1.1TB). I was also thinking that the problem could be network related, but
the network is a virtual docker bridge network on a single machine, and the
backup is usually ok. If it failed during other operations (as this is a
production database) or during every backup it would be easier to see what
the problem could be, but this is really annoyingly random.


>
> > Would increasing the database log level give me any more info about what
> > caused the connection to close?
>
> Nope, not directly.  It might be useful to figure out whether data
> transfer continues full throttle right up until the connection drop,
> or whether it stops sooner (and then there's some sort of timeout
> before the error occurs).
>

I can see that pg_basebackup has a verbose switch, but I am not sure it
will report the stuff you mention. On the database, the log levels
currently are:
client_min_messages = notice
log_min_messages = warning
log_min_error_statement = error

I assume that I should change the first two to at least debug1 to see
something.


>                         regards, tom lane
>

Regards,
Mladen Marinović

Reply via email to