Hi, On Saturday, we had a nice in-person conversation about the requirements that zedstore has for an undo facility vs. the requirements that zheap has vs. the current design of the undo patch set. The people present were: Heikki Linnakangas, Amit Kapila, Thomas Munro, Kuntal Ghosh, Andres Freund, and me. The remainder of this email consists of the notes that I took during that conversation, with some subsequent editing that I did to try to make them more broadly understandable.
zedstore needs a very fast way to determine whether an undo pointer is old enough that it doesn't matter. Perhaps we should keep a backend-local cache of discard pointers so that we can test an undo pointer against the discard horizon without needing to touch shared memory. zedstore needs a very fast way of obtaining the XID when it looks up an undo pointer. It seems inefficient to store the full XID (especially an 8-byte FullTransactionId) in every undo record, but storing it only in the transaction header makes getting the XID prohibitively expensive. Heikki had the idea of storing a flag in each undo record saying 'same X as first tuple as on the page', where X might be XID, CID, relation OID, block number, etc. That seems like a good idea. We'd need to decide exactly how many such flags to have and which fields they cover; and it's probably best not to depend on an undo record for another transaction that might be independently discarded. Another option is to put a "default" for each of these values in the page header and then store a value in individual undo records only if it differs from the value in the page header. We don't have a way to do similar optimization for whatever individual clients of the undo machinery choose to store in the undo record's payload, which might be nice if we had a good idea how to do it. zedstore intends to store an undo pointer per tuple, whereas the current zheap code stores an undo pointer per transaction slot. Therefore, zheap wants an undo record to point to the previous undo record for that transaction slot; whereas zedstore wants an undo record to point to the previous undo record for the same tuple (which might not belong to the same transaction). The undo code that assumes per-block chaining (uur_blkprev) needs to be made more generic. For either zedstore or zheap, newly-inserted tuples could potentially point to an XID/CID rather than an undo record pointer, because visibility checks don't need to look at the actual undo record. An undo record still needs to be written in case we abort, but we don't need to be able to refer to it for visibility purposes. zedstore and zheap have different batching requirements. Both want to process undo records in TID order, but zedstore doesn't care about pages. The current undo patch set calls the RMGR-specific callback once per page; that needs to be generalized. Is it possible that some other table AM might want to sort by something other than TID? Right now, zedstore stores a separate btree for each column, and an extra btree for the visibility information, but it could have column groups instead. If you put all of the columns and the visibility information into a single column group, you'd have a row store. How would the performance of that solution compare with zheap? Maybe zheap should forget about having transaction slots, and just store an undo pointer in each tuple. That would be worse in cases such as bulk loading, where all the tuples on the page are modified by the same transaction ID. We could optimize that case by using something like a transaction slot, but then there's a problem if, say, every tuple on the page is updated or deleted by a separate transaction: the tuples need to get bigger to accommodate separate undo pointers, and the page might overflow. Perhaps we should design a solution that allows for the temporary use of overflow space in such cases. Tuple locking is complicated for both zheap and zedstore. Storing the locker XIDs in the tuple or page is tempting, but you can run out of space; where do you put the extras? Writing an undo record per lock is also tempting, but that could generate a very large amount of undo if there are many transactions that are each locking a whole table, whereas the heap doesn't have that problem, because it uses MultiXacts. On the other hand, in the current heap, it's easily possible for N transactions to burn through O(N^2) MultiXactIds, which is worse than an undo record per lock. A perfect system would avoid permanently bloating the table when many transactions each take many locks, but it's hard to design a system that has that property AND ALSO is maximally compact for a table that is bulk-loaded and then never updated again. (Perhaps it's OK for the table to expand a little bit if we find that we need to make space for a MultiXactId pointer or similar in each tuple? Thomas later dubbed this doubtless-uncontroversial design non-in-place select, and we're all looking forward to review comments on noninplace_select.c.) Heikki proposed an out-of-line btree of tuple locks, with compression to represent locks on TID ranges via a single entry, stored in memory if small and spilling to disk if large. It could be discarded on startup, except for locks held by prepared transactions. Updates might not need to enter their own tuple locks in this btree, but they would need to propagate key share locks forward to new TIDs. Robert and Thomas came up with the idea of having a special kind of undo record that lives outside of a transaction; a backend could attach to an additional undo log, insert one of these special records, and the detach. These special undo records would have a special rule for when they could be discarded; an rmgr callback would decide when it's OK to discard one. This could serve as a replacement for MultiXactIds; you could store a collection of XIDs in the payload of one of these special records. (Andres was not very impressed by this idea.) Later, after Heikki left, we talked about perhaps also allowing these special undo records to have a special rule for when undo actions are executed, so that you could use the undo framework for un-commit actions. It doesn't seem desirable to have to perform on-commit actions frequently; one of the motivating ideas behind zheap is that commits should be as cheap as possible, even if that costs something in the abort case. But it might be useful to have the option available for use in rare or exceptional cases. After this discussion, Heikki is thinking that he might just allow each tuple in zedstore to store both an undo log pointer and a MultiXactId, elided when not needed. The MultiXact system would need to be updated to make MultiXactIds 64-bits in order to avoid needing to freeze. No one seemed keen to build a new storage engine that still requires freezing. When subtransactions are used, the undo system intends to use the toplevel XID for everything, rather than requiring additional XIDs for subtransactions. After some discussion, this seems like it should work well for both zheap and zedstore. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company