Thank you for your thorough explanation. On Thu, May 19, 2022 at 5:47 PM Laurenz Albe <laurenz.a...@cybertec.at> wrote:
> On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote: > > On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.a...@cybertec.at> > wrote: > > > On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote: > > > > When connection is gone or blocked, archive_command fails after the > timeout specified > > > > by the NFS mount, as expected. (for a soft mount. hard mount hangs, > as expected) > > > > > > > > However, on restoring connection, it's not clear to me how long it > takes before the command is retried. > > > > > > > > Experience says "a few minutes", but I can't find documentation on > an exact algorithm. > > > > > > > > To be clear, the question is: if archive_command fails, what are the > specifics of retrying? > > > > Is there a timeout? How is that timeout defined? > > > > > > > > Is this detailed somewhere? Perhaps in the source code? I couldn't > find it in the documentation. > > > > > > > > For detail, I'm using postgres 11, running on Ubuntu 20. > > > > > > You can find the details in "src/backend/postmaster/pgarch.c". > > > > > > The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in > an interval > > > of one second, then back off until it receives a signal, PostgreSQL > shutd down > > > or a minute has passed. > > > > Thanks for the reply. That would mean the source code is here: > > > https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c > > For release 11.0, yes. > > > Just to be sure, the "signal" you speak of, this is the result of the > command executed by archive_command? > > No, that is an operating system signal. > PostgreSQL processes communicate by sending signals to each other, and if > anybody > wakes up the archiver, it will try again. > > > If my understanding of the code is right, if no SIGTERM or other signal > arrives, it won't ever happen > > that a walarchive is skipped if the archive_command fails too many times > or takes too long? It > > will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or > is the 60 seconds the point > > where it stops trying, waiting for the next time archive_command is > invoked? > > Even if a signal arrives, PostgreSQL will keep trying to archive that same > WAL segment > that failed until it is done. > > This is a potential sequence of events: > > try to archive -> fail > sleep 1 second > try to archive -> fail > sleep 1 second > try to archive -> fail > sleep 60 seconds > try to archive -> fail > sleep 1 second > try to archive -> fail > sleep 1 second > try to archive -> fail > sleep 60 seconds -> get woken up by a signal after 30 seconds > try to archive -> fail > sleep 1 second > try to archive -> fail > get shutdown request -> exit > > When PostgreSQL restarts, it will continue trying to archive the same > segment. > > > I'm assuming that as long as the file is still in the pg_wal directory > and as long as there is no > > ".done" file for that walarchive under pg_wal/archive_status, it will > keep trying forever(or until > > someone forcefully switches the timeline with for instance a basebackup)? > > Yes, it will keep trying, and a timeline switch won't change that. > > Yours, > Laurenz Albe > -- > Cybertec | https://www.cybertec-postgresql.com >