On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis <fabrice636...@gmail.com> wrote: > > By passing the autovacuum parameter to off the problem did not occur right > after loading the table as in our previous tests. However, the timeout > occurred later. We have seen the accumulation of .snap files for several Gb. > > ... > -rw-------. 1 postgres postgres 16791226 Sep 20 15:26 > xid-1238444701-lsn-2D2B-F5000000.snap > -rw-------. 1 postgres postgres 16973268 Sep 20 15:26 > xid-1238444701-lsn-2D2B-F6000000.snap > -rw-------. 1 postgres postgres 16790984 Sep 20 15:26 > xid-1238444701-lsn-2D2B-F7000000.snap > -rw-------. 1 postgres postgres 16988112 Sep 20 15:26 > xid-1238444701-lsn-2D2B-F8000000.snap > -rw-------. 1 postgres postgres 16864593 Sep 20 15:26 > xid-1238444701-lsn-2D2B-F9000000.snap > -rw-------. 1 postgres postgres 16902167 Sep 20 15:26 > xid-1238444701-lsn-2D2B-FA000000.snap > -rw-------. 1 postgres postgres 16914638 Sep 20 15:26 > xid-1238444701-lsn-2D2B-FB000000.snap > -rw-------. 1 postgres postgres 16782471 Sep 20 15:26 > xid-1238444701-lsn-2D2B-FC000000.snap > -rw-------. 1 postgres postgres 16963667 Sep 20 15:27 > xid-1238444701-lsn-2D2B-FD000000.snap > ... >
Okay, still not sure why the publisher is not sending keep_alive messages in between spilling such a big transaction. If you see, we have logic in WalSndLoop() wherein each time after sending data we check whether we need to send a keep-alive message via function WalSndKeepaliveIfNecessary(). I think to debug this problem further you need to add some logs in function WalSndKeepaliveIfNecessary() to see why it is not sending keep_alive messages when all these files are being created. Did you change the default value of wal_sender_timeout/wal_receiver_timeout? What is the value of those variables in your environment? Did you see the message "terminating walsender process due to replication timeout" in your server logs? -- With Regards, Amit Kapila.