Hi,

On 2022-01-24 16:38:54 +0400, Pavel Borisov wrote:
> +64-bit Transaction ID's (XID)
> +=============================
> +
> +A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent
> +wraparound every N/2 transactions. This causes performance degradation due
> +to the need to exclusively lock tables while being vacuumed. In each
> +wraparound cycle, SLRU buffers are also being cut.

What exclusive lock?


> +"Double XMAX" page format
> +---------------------------------
> +
> +At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL
> +version pd_special area with a size of 16 bytes should be added to a page.
> +Though a page may not have space for this. Then it can be converted to a
> +temporary format called "double XMAX".
>
> +All tuples after pg-upgrade would necessarily have xmin = 
> FrozenTransactionId.

Why would a tuple after pg-upgrade necessarily have xmin =
FrozenTransactionId? A pg_upgrade doesn't scan the tables, so the pg_upgrade
itself doesn't do anything to xmins.

I guess you mean that the xmin cannot be needed anymore, because no older
transaction can be running?


> +In-memory tuple format
> +----------------------
> +
> +In-memory tuple representation consists of two parts:
> +- HeapTupleHeader from disk page (contains all heap tuple contents, not only
> +header)
> +- HeapTuple with additional in-memory fields
> +
> +HeapTuple for each tuple in memory stores t_xid_base/t_multi_base - a copies 
> of
> +page's pd_xid_base/pd_multi_base. With tuple's 32-bit t_xmin and t_xmax from
> +HeapTupleHeader they are used to calculate actual 64-bit XMIN and XMAX:
> +
> +XMIN = t_xmin + t_xid_base.                                  (3)
> +XMAX = t_xmax + t_xid_base/t_multi_base.             (4)

What identifies a HeapTuple as having this additional data?


> +The downside of this is that we can not use tuple's XMIN and XMAX right away.
> +We often need to re-read t_xmin and t_xmax - which could actually be pointers
> +into a page in shared buffers and therefore they could be updated by any 
> other
> +backend.

Ugh, that's not great.


> +Upgrade from 32-bit XID versions
> +--------------------------------
> +
> +pg_upgrade doesn't change pages format itself. It is done lazily after.
> +
> +1. At first heap page read, tuples on a page are repacked to free 16 bytes
> +at the end of a page, possibly freeing space from dead tuples.

That will cause a *massive* torrent of writes after an upgrade. Isn't this
practically making pg_upgrade useless?  Imagine a huge cluster where most of
the pages are all-frozen, upgraded using link mode.


What happens if the first access happens on a replica?


What is the approach for dealing with multixact files? They have xids
embedded?  And currently the SLRUs will break if you just let the offsets SLRU
grow without bounds.



> +void
> +convert_page(Relation rel, Page page, Buffer buf, BlockNumber blkno)
> +{
> +     PageHeader      hdr = (PageHeader) page;
> +     GenericXLogState *state = NULL;
> +     Page    tmp_page = page;
> +     uint16  checksum;
> +
> +     if (!rel)
> +             return;
> +
> +     /* Verify checksum */
> +     if (hdr->pd_checksum)
> +     {
> +             checksum = pg_checksum_page((char *) page, blkno);
> +             if (checksum != hdr->pd_checksum)
> +                     ereport(ERROR,
> +                                     (errcode(ERRCODE_INDEX_CORRUPTED),
> +                                      errmsg("page verification failed, 
> calculated checksum %u but expected %u",
> +                                                     checksum, 
> hdr->pd_checksum)));
> +     }
> +
> +     /* Start xlog record */
> +     if (!XactReadOnly && XLogIsNeeded() && RelationNeedsWAL(rel))
> +     {
> +             state = GenericXLogStart(rel);
> +             tmp_page = GenericXLogRegisterBuffer(state, buf, 
> GENERIC_XLOG_FULL_IMAGE);
> +     }
> +
> +     PageSetPageSizeAndVersion((hdr), PageGetPageSize(hdr),
> +                                                       
> PG_PAGE_LAYOUT_VERSION);
> +
> +     if (was_32bit_xid(hdr))
> +     {
> +             switch (rel->rd_rel->relkind)
> +             {
> +                     case 'r':
> +                     case 'p':
> +                     case 't':
> +                     case 'm':
> +                             convert_heap(rel, tmp_page, buf, blkno);
> +                             break;
> +                     case 'i':
> +                             /* no need to convert index */
> +                     case 'S':
> +                             /* no real need to convert sequences */
> +                             break;
> +                     default:
> +                             elog(ERROR,
> +                                      "Conversion for relkind '%c' is not 
> implemented",
> +                                      rel->rd_rel->relkind);
> +             }
> +     }
> +
> +     /*
> +      * Mark buffer dirty unless this is a read-only transaction (e.g. query
> +      * is running on hot standby instance)
> +      */
> +     if (!XactReadOnly)
> +     {
> +             /* Finish xlog record */
> +             if (XLogIsNeeded() && RelationNeedsWAL(rel))
> +             {
> +                     Assert(state != NULL);
> +                     GenericXLogFinish(state);
> +             }
> +
> +             MarkBufferDirty(buf);
> +     }
> +
> +     hdr = (PageHeader) page;
> +     hdr->pd_checksum = pg_checksum_page((char *) page, blkno);
> +}

Wait. So you just modify the page without WAL logging or marking it dirty on a
standby? I fail to see how that can be correct.

Imagine the cluster is promoted, the page is dirtied, and we write it
out. You'll have written out a completely changed page, without any WAL
logging. There's plenty other scenarios.


Greetings,

Andres Freund


Reply via email to