On Fri, Nov 22, 2024 at 12:30 AM Matthias van de Meent < boekewurm+postg...@gmail.com> wrote:
> On Thu, 21 Nov 2024, 17:18 Erik Nordström, <e...@timescale.com> wrote: > >> Hello, >> >> I've noticed a change in behavior of the heap rewrite functionality in >> PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the >> functionality to implement a way to merge partitions in TimescaleDB. I am >> using table_relation_copy_for_cluster() to write the data of several tables >> to a single merged table, and then I do a heap swap on one of the original >> tables while dropping the others. So, if you have 3 partitions and want to >> merge them to one, then I write all three partitions to a temporary heap, >> swap the new heap on partition 1 and then drop partitions 2 and 3. >> >> Now, this approach worked fine for PostgreSQL 15 and 16, but 17 >> introduced some changes that altered the behavior so that I only see data >> from one of the partitions after merge (the last one written). >> >> The commit that I think is responsible is the following: >> https://github.com/postgres/postgres/commit/8af256524893987a3e534c6578dd60edfb782a77 >> >> Now, I realize that this is not a problem for PostgreSQL itself since the >> rewrite functionality isn't used for the purpose I am using it. To my >> defense, the rewrite code seems to imply that it should be possible to >> write more data to an existing heap according to this comment in >> begin_heap_rewrite: /* new_heap needn't be empty, just locked */. >> >> I've also tried recompiling PG17 with the rewriteheap.c file from PG16 >> and then it works again. I haven't yet been able to figure out exactly what >> is different but I will continue to try to narrow it down. In the meantime, >> maybe someone on the mailing list has some insight on what could be the >> issue and whether my approach is viable? >> > > I think the origin of the issue is at bulk_write.c, in smgr_bulk_flush > (specifically after line 282), which assumes the bulk write system is used > exclusively on empty relations. > > If you use a separate pair of begin/end_heap_rewrite for every relation > you merge into that heap, it'll re-initialize the bulk writer, which will > thus overwrite the previous rewrites' pages. The pre-PG17 rewriteheap.c > doesn't use that API, and thus isn't affected. > > I've CC-ed Heikki as author of that patch; maybe a new API to indicate > bulk writes into an existing fork would make sense, to solve Timescale's > issue and fix rewriteheap's behavior? > > > Kind regards, > > Matthias van de Meent > Neon (https://neon.tech) > Yes, looks like it is that code that zeros out the initial pages if the write doesn't start at block 0. And it looks like I don't have access to the bulk write state to set pages_written to trick it to believe those pages have already been written. Maybe it is possible to add an option to the bulkwriter to skip zero-initialization? The comment for that code seems to indicate it isn't strictly necessary anyway. Regards, Erik