On Wed, Aug 14, 2019 at 2:57 AM Andres Freund <and...@anarazel.de> wrote: > - My reading of the current xact.c integration is that it's not workable > as is. Undo is executed outside of a valid transaction state, > exceptions aren't properly undone, logic would need to be duplicated > to a significant degree, new kind of critical section.
Regarding this particular point: ReleaseResourcesAndProcessUndo() is only supposed to be called after AbortTransaction(), and the additional steps it performs -- AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction. That's not crazy; the other steps in Cleanup(Sub)Transaction() look like stuff that's intended to be performed when we're totally done with this TransactionState stack entry, whereas these things are slightly higher-level cleanups that might even block undo (e.g. undropped portal prevents orphaned file cleanup). Granted, there are no comments explaining why those particular cleanup steps are performed here, and it's possible some other approach is better, but I think perhaps it's not quite as flagrantly broken as you think. I am also not convinced that semi-critical sections are a bad idea, although the if (SemiCritSectionCount > 0) test at the start of ReleaseResourcesAndProcessUndo() looks wrong. To roll back a subtransaction, we must perform undo in the foreground, and if that fails, the toplevel transaction can't be allowed to commit, full stop. Since we expect this to be a (very) rare scenario, I don't know why escalating to FATAL is a catastrophe. The only other option is to do something along the lines of SxactIsDoomed(), where we force all subsequent commits (and sub-commits?) within the toplevel xact to fail. You can argue that the latter is a better user experience, and for SSI I certainly agree, but this case isn't quite the same: there's a good chance we're dealing with a corrupted page or system administrator intervention to try to kill off a long-running undo task, and continuing in such cases seems a lot more dubious than after a serializability failure, where retrying is the expected recovery mode. The other case is where toplevel undo for a temporary table fails. It is unclear to me what, other than FATAL, could suffice there. I guess you could just let the session continue and leave the transaction undone, leaving whatever MVCC machinery the table AM may have look through it, but that sounds inferior to me. Rip the bandaid off. Some general complaints from my side about the xact.c changes: 1. The code structure doesn't seem quite right. For example: 1a. ProcessUndoRequestForEachLogCat has a try/catch block, but it seems to me that the job of a try/catch block is to provide structured error-handling for code for resources for which there's no direct handling in xact.c or resowner.c. Here, we're inside of xact.c, so why are we adding a try/catch block 1b. ReleaseResourcesAndProcessUndo does part of the work of cleaning up a failed transaction but not all of it, the rest being done by AbortTransaction, which is called before entering it, plus it also kicks off the actual undo work. I would expect a cleaner division of responsibility. 1c. Having an undo request per UndoLogCategory rather than one per transaction doesn't seem right to me; hopefully that will get cleaned up when the undorequest.c stuff I sent before is integrated. 1d. The code at the end of FinishPreparedTransaction() seems to expect that the called code will catch any error, but that clearing the error state might need to happen here, and also that we should fire up a new transaction; I suspect, but am not entirely sure, that that is not the way it should work. The code added earlier in that function also looks suspicious, because it's filling up what is basically a high-level control function with a bunch of low-level undo-specific details. In both places, the undo-specific concerns probably need to be better-isolated. 2. Signaling is done using some odd-looking mechanisms. For instance: 2a. The SemiCritSectionCount > 0 test at the top of ReleaseResourcesAndProcessUndo that I complained about earlier looks like a guard against reentrancy, but that must be the wrong way to get there; it makes it impossible to reuse what is ostensibly a general-purpose facility for any non-undo related purpose without maybe breaking something. 2b. ResetUndoActionsInfo() is called from a bunch of place, but only 2 of those places have a comment explaining why, and the function comment is pretty unilluminating. This looks like some kind of signaling machinery, but it's not very clear to me what it's actually trying to do. 2c. ResourceOwnerReleaseInternal() is directly calling NeedToPerformUndoActions(), which feels like a layering violation. 2d. I'm not really sure that TRANS_UNDO is serving any useful purpose; I think we need TBLOCK_UNDO and TBLOCK_SUBUNDO, but I'm not really sure TRANS_UNDO is doing anything useful; the change to SubTransactionIsActive() looks wrong to me, and the other changes I think would mostly go away if we just used TRANS_INPROGRESS. 2e. I'm skeptical that the err_out_to_client() stuff is the right way to suppress undo failure messages from being sent to the client. That needs to be done, but this doesn't seem like the right way. This is related to my complaint above about using a try/catch block inside xact.c. 3. I noticed a few other mistakes when reading through this again which I include here for the sake of completeness: 3a. memset(..., InvalidUndoRecPtr, ...) will only happen to work if every byte of InvalidUndoRecPtr happens to have the same value. That happens to be true, because it's defined 8 bytes of zeroes, but it's not OK to code it like this. 3b. "undoRequestResgistered" is a typo. 3c. GetEpochForXid definitely shouldn't exist any more... as has been reported in the past. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company