Hello Thomas, While testing zheap over undo apis, we've found the following issues/scenarios that might need some fixes/discussions:
1. In UndoLogAllocateInRecovery, when we find the current log number from the list of registered blocks, we don't check whether the block->in_use flag is true or not. In XLogResetInsertion, we just reset in_use flag without reseting the blocks[]->rnode information. So, if we don't check the in_use flag, it's possible that we'll consult some block information from the previous WAL record. IMHO, just adding an in_use check in UndoLogAllocateInRecovery will solve the problem. 2. A transaction, inserts one undo record and generated a WAL record for the same, say at WAL location 0/2000A000. Next, the undo record gets discarded and WAL is generated to update the meta.discard pointer at location 0/2000B000 At the same time, an ongoing checkpoint with checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer. Now, the system crashes. Now, the recovery starts from the location 0/20000000. When the recovery of 0/2000A000 happens, it sees the undo record that it's about to insert, is already discarded as per meta.discard (flushed by checkpoint). In this case, should we just skip inserting the undo record? 3. Currently, we create a backup image of the unlogged part of the undo log's metadata only when some backend allocates some space from the undo log (in UndoLogAllocate). This helps us restore the unlogged meta part after a checkpoint. When we perform an undo action, we also update the undo action progress and emit an WAL record. The same operation can performed by the undo worker which doesn't allocate any space from the undo log. So, if an undo worker emits an WAL record to update undo action progress after a checkpoint, it'll not be able to WAL log the backup image of the meta unlogged part. IMHO, this breaks the recovery logic of unlogged part of undo meta. Thoughts? On Mon, Sep 2, 2019 at 9:47 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh <kuntalghosh.2...@gmail.com> > wrote: > > I'm getting the following assert failure while performing the recovery > > with the same. > > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL", > > File: "undolog.c", Line: 997)" > > > > I found that we don't emit an WAL record when we update the > > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after > > crash recovery, some new transaction may use that undo log which is > > wrong, IMHO. Am I missing something? > > Thanks, right, that status logging is wrong, will fix in next version. > > -- > Thomas Munro > https://enterprisedb.com -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com