On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > > On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar > <dilip.ku...@enterprisedb.com> wrote: > > On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbal...@gmail.com> wrote: > > > As Thomas has already mentioned upthread that we are working on an > > > undo-log based storage and he has posted the patch sets for the lowest > > > layer called undo-log-storage. > > > > > > This is the next layer which sits on top of the undo log storage, > > > which will provide an interface for prepare, insert, or fetch the undo > > > records. This layer will use undo-log-storage to reserve the space for > > > the undo records and buffer management routine to write and read the > > > undo records. > > I have also pushed a new WIP version of the lower level undo log > storage layer patch set to a public branch[1]. I'll leave the earlier > branch[2] there because the record-level patch posted by Dilip depends > on it for now. >
I have started reading the patch and have a few assorted comments which are mentioned below. I have been involved in the high-level design of this module and I have also shared given some suggestions during development, but this is mainly Thomas's work with some help from Dilip. It would be good if other members of the community also review the design or participate in the discussion. Comments ------------------ undo/README ----------------------- 1. +The undo log subsystem provides a way to store data that is needed for +a limited time. Undo data is generated whenever zheap relations are +modified, but it is only useful until (1) the generating transaction +is committed or rolled back and (2) there is no snapshot that might +need it for MVCC purposes. I think the snapshots need it for MVCC purpose and we need it till the transaction is committed and all-visible. 2. +* the tablespace that holds its segment files +* the persistence level (permanent, unlogged, temporary) +* the "discard" pointer; data before this point has been discarded +* the "insert" pointer: new data will be written here +* the "end" pointer: a new undo segment file will be needed at this point + +The three pointers discard, insert and end move strictly forwards +until the whole undo log has been exhausted. At all times discard <= +insert <= end. When discard == insert, the undo log is empty +(everything that has ever been inserted has since been discarded). +The insert pointer advances when regular backends allocate new space, +and the discard pointer usually advances when an undo worker process +determines that no session could need the data either for rollback or +for finding old versions of tuples to satisfy a snapshot. In some +special cases including single-user mode and temporary undo logs the +discard pointer might also be advanced synchronously by a foreground +session. Here, the use of insert and discard pointer are explained nicely. Can you elaborate the usage of end pointer as well? 3. +UndoLogControl objects corresponding to the current set of active undo +logs are held in a fixed-sized pool in shared memory. The size of +the array is a multiple of max_connections, and limits the total size of +transactions. Here, it is mentioned the array is multiple of max_connections, but the code uses MaxBackends. Can you sync them? 4. +The meta-data for all undo logs is written to disk at every +checkpoint. It is stored in files under PGDATA/pg_undo/, using the +checkpoint's redo point (a WAL LSN) as its filename. At startup time, +the redo point's file can be used to restore all undo logs' meta-data +as of the moment of the redo point into shared memory. Changes to the +discard pointer and end pointer are WAL-logged by undolog.c and will +bring the in-memory meta-data up to date in the event of recovery +after a crash. Changes to insert pointers are included in other WAL +records (see below). I see one inconvenience for using checkpoint's redo point for meta file name which is what if someone uses pg_resetxlog to truncate the redo? Is there any reason we can't keep a different name for the meta file? 5. +stabilize on one undo log per active writing backend (or more if +different tablespaces are persistence levels are used). /tablespaces are persistence levels/tablespaces and persistence levels I think due to the above design, we can now reach the maximum number of undo logs quickly as the patch now uses fixed shared memory to represent them. I am not sure if there is an easy way to avoid that. Can we try to expose guc for a maximum number of undo slots such that instead of MaxBackends * 4, it could be MaxBackends * <new_guc>? undo-log-manager patch ------------------------------------ 6. @@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, ProcGlobalShmemSize()); size = add_size(size, XLOGShmemSize()); size = add_size(size, CLOGShmemSize()); + size = add_size(size, UndoLogShmemSize()); size = add_size(size, CommitTsShmemSize()); size = add_size(size, SUBTRANSShmemSize()); size = add_size(size, TwoPhaseShmemSize()); @@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) */ XLOGShmemInit(); CLOGShmemInit(); + UndoLogShmemInit(); It seems to me that we will always allocate shared memory for undo logs irrespective of whether someone wants to use them or not. Am I right? If so, isn't it better if we find some way that this memory is allocated only when someone has a need for it? 7. +/* + * How many undo logs can be active at a time? This creates a theoretical + * maximum transaction size, but it we set it to a factor the maximum number + * of backends it will be a very high limit. Alternative designs involving + * demand paging or dynamic shared memory could remove this limit but + * introduce other problems. + */ +static inline size_t +UndoLogNumSlots(void) +{+ return MaxBackends * 4; +} Seems like typos in above comment /but it we/but if we /factor the maximum number -- the sentence is not completely clear. 8. + * Extra shared memory will be managed using DSM segments. + */ +Size +UndoLogShmemSize(void) You told in the email that the patch doesn't use DSM segments anymore, but the comment seems to indicate otherwise. 9. /* + * TODO: Should this function be usable in a critical section? + * Woudl it make sense to detect that we are in a critical Typo /Woudl/Would 10. +static void +undolog_xlog_attach(XLogReaderState *record) +{ + xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record); + UndoLogControl *log; + + undolog_xid_map_add(xlrec->xid, xlrec->logno); + + /* + * Whatever follows is the first record for this transaction. Zheap will + * use this to add UREC_INFO_TRANSACTION. + */ + log = get_undo_log(xlrec->logno, false); + /* TODO */ There are a lot of TODO's in the code, among them, above is not at all clear. 11. + UndoRecPtr oldest_data; + +} UndoLogControl; Extra space after the last member looks odd. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com