On Fri, Jan 15, 2010 at 12:23 AM, Heikki Linnakangas <heikki.linnakan...@enterprisedb.com> wrote: > If we don't fix that within the server, we will need to document that > caveat and every installation will need to work around that one way or > another. Maybe with some monitoring software and an automatic restart. Ugh. > > I wasn't really asking if it's possible to fix, I meant "Let's think > about *how* to fix that".
OK. How about the following (though it's a rough design)? (1) If walsender cannot read the WAL file because of ENOENT, it sends the special message indicating that error to walreceiver. This message is shipped on the COPY protocol. (2-a) If the message arrives, walreceiver exits by using proc_exit(). (3-a) If the startup process detects the exit of walreceiver in WaitNextXLogAvailable(), it switches back to a normal archive recovery mode, closes the currently opened WAL file, resets some variables (readId, readSeg, etc), and calls FetchRecord() again. Then it tries to restore the WAL file from the archive if the restore_command is supplied, and switches to a streaming recovery mode again if invalid WAL is found. Or (2-b) If the message arrives, walreceiver executes restore_command, and then sets the receivedUpto to the end location of the restored WAL file. The restored file is expected to be filled because it doesn't exist in the primary's pg_xlog. So that update of the receivedUpto is OK. (3-b) After one WAL file is restored, walreceiver tries to connect to the primary, and starts replication again. If the ENOENT error occurs again, we go back to the (1). I like the latter approach since it's simpler. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers