On Wed, Feb 12, 2020 at 4:09 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > =?UTF-8?Q?Mladen_Marinovi=C4=87?= <mladen.marino...@kset.org> writes: > > Recently I am having some strange problems with pg_basebackup. About > once a > > week the backup process ends with an error message like this: > > 2020-02-11 23:25:40 UTC [25790]: [1-1] user=replicator,db=[unknown] LOG: > > could not send data to client: Connection reset by peer > > Hmmm .... > > > The problem started occurring after a hardware (RAM + SSD) upgrade and an > > OS Upgrade to Ubuntu 18.04. Both the server and backup process run in > > separate docker containers on the same machine. This happens randomly on > > multiple servers with the same configuration and it is probably not > > hardware related. Also, this happens evenly on 9.4 and 9.6, and using the > > same docker images that worked flawlessly on the previous installation. > > I have been investigating the issue for at least a month and found no > > problems in any log or metric before or after the event. I suspect that > > this is related to some OS/docker parameter that is not well configured. > > How long does the backup run before failing? If the connection were going > between different machines my suspicions would lean toward a network > timeout. That seems somewhat unlikely in this configuration, but you > never know. >
The backup started at 23:00, and it copied 363GB by the time the connection was closed. It usually takes about 2 hours for the entire database (cca. 1.1TB). I was also thinking that the problem could be network related, but the network is a virtual docker bridge network on a single machine, and the backup is usually ok. If it failed during other operations (as this is a production database) or during every backup it would be easier to see what the problem could be, but this is really annoyingly random. > > > Would increasing the database log level give me any more info about what > > caused the connection to close? > > Nope, not directly. It might be useful to figure out whether data > transfer continues full throttle right up until the connection drop, > or whether it stops sooner (and then there's some sort of timeout > before the error occurs). > I can see that pg_basebackup has a verbose switch, but I am not sure it will report the stuff you mention. On the database, the log levels currently are: client_min_messages = notice log_min_messages = warning log_min_error_statement = error I assume that I should change the first two to at least debug1 to see something. > regards, tom lane > Regards, Mladen Marinović