Hi, On 2022-01-24 16:38:54 +0400, Pavel Borisov wrote: > +64-bit Transaction ID's (XID) > +============================= > + > +A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent > +wraparound every N/2 transactions. This causes performance degradation due > +to the need to exclusively lock tables while being vacuumed. In each > +wraparound cycle, SLRU buffers are also being cut.
What exclusive lock? > +"Double XMAX" page format > +--------------------------------- > + > +At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL > +version pd_special area with a size of 16 bytes should be added to a page. > +Though a page may not have space for this. Then it can be converted to a > +temporary format called "double XMAX". > > +All tuples after pg-upgrade would necessarily have xmin = > FrozenTransactionId. Why would a tuple after pg-upgrade necessarily have xmin = FrozenTransactionId? A pg_upgrade doesn't scan the tables, so the pg_upgrade itself doesn't do anything to xmins. I guess you mean that the xmin cannot be needed anymore, because no older transaction can be running? > +In-memory tuple format > +---------------------- > + > +In-memory tuple representation consists of two parts: > +- HeapTupleHeader from disk page (contains all heap tuple contents, not only > +header) > +- HeapTuple with additional in-memory fields > + > +HeapTuple for each tuple in memory stores t_xid_base/t_multi_base - a copies > of > +page's pd_xid_base/pd_multi_base. With tuple's 32-bit t_xmin and t_xmax from > +HeapTupleHeader they are used to calculate actual 64-bit XMIN and XMAX: > + > +XMIN = t_xmin + t_xid_base. (3) > +XMAX = t_xmax + t_xid_base/t_multi_base. (4) What identifies a HeapTuple as having this additional data? > +The downside of this is that we can not use tuple's XMIN and XMAX right away. > +We often need to re-read t_xmin and t_xmax - which could actually be pointers > +into a page in shared buffers and therefore they could be updated by any > other > +backend. Ugh, that's not great. > +Upgrade from 32-bit XID versions > +-------------------------------- > + > +pg_upgrade doesn't change pages format itself. It is done lazily after. > + > +1. At first heap page read, tuples on a page are repacked to free 16 bytes > +at the end of a page, possibly freeing space from dead tuples. That will cause a *massive* torrent of writes after an upgrade. Isn't this practically making pg_upgrade useless? Imagine a huge cluster where most of the pages are all-frozen, upgraded using link mode. What happens if the first access happens on a replica? What is the approach for dealing with multixact files? They have xids embedded? And currently the SLRUs will break if you just let the offsets SLRU grow without bounds. > +void > +convert_page(Relation rel, Page page, Buffer buf, BlockNumber blkno) > +{ > + PageHeader hdr = (PageHeader) page; > + GenericXLogState *state = NULL; > + Page tmp_page = page; > + uint16 checksum; > + > + if (!rel) > + return; > + > + /* Verify checksum */ > + if (hdr->pd_checksum) > + { > + checksum = pg_checksum_page((char *) page, blkno); > + if (checksum != hdr->pd_checksum) > + ereport(ERROR, > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg("page verification failed, > calculated checksum %u but expected %u", > + checksum, > hdr->pd_checksum))); > + } > + > + /* Start xlog record */ > + if (!XactReadOnly && XLogIsNeeded() && RelationNeedsWAL(rel)) > + { > + state = GenericXLogStart(rel); > + tmp_page = GenericXLogRegisterBuffer(state, buf, > GENERIC_XLOG_FULL_IMAGE); > + } > + > + PageSetPageSizeAndVersion((hdr), PageGetPageSize(hdr), > + > PG_PAGE_LAYOUT_VERSION); > + > + if (was_32bit_xid(hdr)) > + { > + switch (rel->rd_rel->relkind) > + { > + case 'r': > + case 'p': > + case 't': > + case 'm': > + convert_heap(rel, tmp_page, buf, blkno); > + break; > + case 'i': > + /* no need to convert index */ > + case 'S': > + /* no real need to convert sequences */ > + break; > + default: > + elog(ERROR, > + "Conversion for relkind '%c' is not > implemented", > + rel->rd_rel->relkind); > + } > + } > + > + /* > + * Mark buffer dirty unless this is a read-only transaction (e.g. query > + * is running on hot standby instance) > + */ > + if (!XactReadOnly) > + { > + /* Finish xlog record */ > + if (XLogIsNeeded() && RelationNeedsWAL(rel)) > + { > + Assert(state != NULL); > + GenericXLogFinish(state); > + } > + > + MarkBufferDirty(buf); > + } > + > + hdr = (PageHeader) page; > + hdr->pd_checksum = pg_checksum_page((char *) page, blkno); > +} Wait. So you just modify the page without WAL logging or marking it dirty on a standby? I fail to see how that can be correct. Imagine the cluster is promoted, the page is dirtied, and we write it out. You'll have written out a completely changed page, without any WAL logging. There's plenty other scenarios. Greetings, Andres Freund