On Fri, Sep 14, 2018 at 04:30:37PM +0530, Amit Kapila wrote: > On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <mich...@paquier.xyz> wrote: >> So, I have been working on this problem again and I have reviewed the >> thread, and there have been many things discussed in the last couple of >> months: >> 1) We do not want to initialize XLogInsert stuff unconditionally for all >> processes at the moment recovery begins, but we just want to initialize >> it once WAL write is open for business. >> 2) Both the checkpointer and the startup process can call >> UpdateFullPageWrites() which can cause Insert->fullPageWrites to get >> incorrect values. > > Can you share the steps to reproduce this problem?
This refers to the first problem reported on this thread: https://www.postgresql.org/message-id/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC%3DC2whnzBM8TAcv%3DstckYUw%40mail.gmail.com In order to reproduce the problem, you can for example stop the server in immediate mode. Then attach a debugger to it and add a breakpoint to UpdateFullPageWrites. You can check that XLOG insert has not been initialized yet by looking at xloginsert_cxt ot ThisTimeLineID. On a second session, switch full_page_writes to on or off, reload the parameters and then trigger a checkpoint. The important point is to trigger an inconsistency between XLogCtl->Insert->fullPageWrites and the value of fullPageWrites within the checkpointer context. With the checkpoint triggered, the debugger will stop at UpdateFullPageWrites immediately. At this point, you can simply check if fullPageWrites Insert->fullPageWrites have the same value or a different one. If the values match, simply switch full_page_writes and reload again, with the checkpointer still waiting at the beginning of UpdateFullPageWrites. SIGHUP will make the checkpointer process hang a bit, and then it will move on. At this point you will be able to see the failure: TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731) 2018-09-18 15:06:39 JST [7396]: [11-1] db=,user=,app=,client= LOG: checkpointer process (PID 7399) was terminated by signal 6: Aborted > On a regular startup when there is no recovery, it won't allow us to > log the WAL record (XLOG_FPW_CHANGE) which can happen without above > change. You can check that by setting full_page_writes=off and start > the system. Oh, good point, InRecovery is set to false in this case so that would be skipped. We can simply fix that by adding a flag, say "force" to UpdateFullPageWrites to allow a process to enforce the update of FPW even if RecoveryInProgress returns true, which would be the case for the startup process. -- Michael
signature.asc
Description: PGP signature