> On Mar 5, 2026, at 07:33, <[email protected]> <[email protected]> wrote:
>
> Attached is a new patch to demonstrate the bug in make_new_segment.
> It’s not meant as a permanent new test (it’s using some hard-coded
> assumptions about the x64 platform) -- just useful in validating the bug and
> fix.
> From: [email protected] <[email protected]>
> Sent: Wednesday, March 4, 2026 12:01 AM
> To: [email protected]
> Subject: [BUG + PATCH] DSA pagemap out-of-bounds in make_new_segment
> odd-sized path
> Hi hackers,
> Sorry for the previously poorly-formatted/threaded email.
> We've identified a bug in the DSA (Dynamic Shared Memory Area) allocator
> that causes memory corruption and crashes during parallel hash joins with
> large data sets. The bug has been present since the DSA implementation
> was introduced and affects all supported branches.
> Attached is a minimal fix (5 lines added, 0 changed).
> == Bug Summary ==
> In make_new_segment() (src/backend/utils/mmgr/dsa.c), there are two paths
> for computing segment layout:
> Path 1 (geometric): knows total_size upfront, computes
> total_pages = total_size / FPM_PAGE_SIZE
> pagemap entries = total_pages <-- correct
> Path 2 (odd-sized, when requested > geometric): works forward from
> usable_pages = requested_pages
> pagemap entries = usable_pages <-- BUG
> The pagemap is indexed by absolute page number. The FreePageManager hands
> out pages with indices from metadata_pages to (metadata_pages +
> usable_pages - 1). Since metadata_pages >= 1, page indices at the high
> end of the range exceed usable_pages, making the pagemap accesses
> out-of-bounds.
> == How It Was Found on Postgres 15 ==
> Multiple parallel worker backends crashed simultaneously with SIGSEGV in
> dsa_get_address(), called from dsa_free() in ExecHashTableDetachBatch()
> during parallel hash join batch cleanup.
> Stack trace:
> #0 dsa_get_address (area, dp) at dsa.c:955
> #1 dsa_free (area, dp) at dsa.c:839
> #2 ExecHashTableDetachBatch (hashtable) at nodeHash.c:3189
> #3 ExecParallelHashJoinNewBatch (hjstate) at nodeHashjoin.c:1157
> All crashing workers computed the same pageno (196993), which was the last
> valid FPM page but beyond the pagemap array (usable_pages = 196609).
> == Root Cause (from core dump analysis) ==
> The crashing segment had:
> usable_pages = 196,609 (pagemap has this many entries)
> metadata_pages = 385
> total_pages = 196,994 (= metadata_pages + usable_pages)
> last FPM page = 196,993 (= metadata_pages + usable_pages - 1)
> pagemap valid indices: 0 .. 196,608
> FPM page indices: 385 .. 196,993
> Pages 196,609 through 196,993 (385 pages) have no valid pagemap entry.
> The pagemap array ends 3,072 bytes before the data area starts (padding
> zone). Most out-of-bounds entries fall in this padding and cause silent
> corruption. The last 17 entries fall in the data area itself, causing
> bidirectional corruption: pagemap writes destroy allocated object data,
> and subsequent pagemap reads return garbage, crashing dsa_free().
While reviewing this patch, I took the opportunity to read through the DSA code
again.
I think the reason why the bug is hard to trigger is that, even if out-of-bound
of pagemap happens, in most cases, the OOB access drops into the padding area,
which is actually safe. Only when the padding area is not big enough to hold
metadata_bytes/FPM_PAGE_SIZE dsa_pointer entries, it encounters a segment fault
error.
> == The Fix ==
> After computing metadata_bytes with usable_pages pagemap entries, add
> entries for the metadata pages themselves:
> metadata_bytes +=
> ((metadata_bytes / (FPM_PAGE_SIZE - sizeof(dsa_pointer))) + 1) *
> sizeof(dsa_pointer);
> The divisor (FPM_PAGE_SIZE - sizeof(dsa_pointer)) = 4088 accounts for
> the circular dependency: each metadata page costs one pagemap entry
> (8 bytes), so only 4088 of each 4096-byte metadata page is net-free for
> other pagemap entries. The +1 absorbs the ceiling.
I think this fix is very clever. But the problem I see is that, it adds extra
bytes to metadata_bytes, then compute padding size. But I believe the padding
area is safe to hold pagemap. Thus, if the padding area is big enough, then we
don’t need to add extra bytes to metadata_bytes. Based on that, I made a fix:
```
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index ce9ede4c196..b7a089bc288 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -2131,6 +2131,7 @@ static dsa_segment_map *
make_new_segment(dsa_area *area, size_t requested_pages)
{
dsa_segment_index new_index;
+ size_t metadata_fixed_bytes;
size_t metadata_bytes;
size_t total_size;
size_t total_pages;
@@ -2180,9 +2181,11 @@ make_new_segment(dsa_area *area, size_t requested_pages)
area->control->total_segment_size);
total_pages = total_size / FPM_PAGE_SIZE;
- metadata_bytes =
+ metadata_fixed_bytes =
MAXALIGN(sizeof(dsa_segment_header)) +
- MAXALIGN(sizeof(FreePageManager)) +
+ MAXALIGN(sizeof(FreePageManager));
+ metadata_bytes =
+ metadata_fixed_bytes +
sizeof(dsa_pointer) * total_pages;
/* Add padding up to next page boundary. */
@@ -2196,19 +2199,36 @@ make_new_segment(dsa_area *area, size_t requested_pages)
/* See if that is enough... */
if (requested_pages > usable_pages)
{
+ size_t pagemap_entries;
+
/*
* We'll make an odd-sized segment, working forward from the
requested
* number of pages.
*/
usable_pages = requested_pages;
- metadata_bytes =
- MAXALIGN(sizeof(dsa_segment_header)) +
- MAXALIGN(sizeof(FreePageManager)) +
- usable_pages * sizeof(dsa_pointer);
-
- /* Add padding up to next page boundary. */
- if (metadata_bytes % FPM_PAGE_SIZE != 0)
- metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes %
FPM_PAGE_SIZE);
+ pagemap_entries = usable_pages;
+
+ for (;;)
+ {
+ size_t actual_pages;
+ size_t metadata_pages;
+
+ metadata_bytes =
+ metadata_fixed_bytes +
+ pagemap_entries * sizeof(dsa_pointer);
+
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE -
(metadata_bytes % FPM_PAGE_SIZE);
+
+ metadata_pages = metadata_bytes / FPM_PAGE_SIZE;
+ actual_pages = metadata_pages + usable_pages;
+ if (pagemap_entries >= actual_pages)
+ break;
+
+ pagemap_entries = actual_pages;
+ }
+
total_size = metadata_bytes + usable_pages * FPM_PAGE_SIZE;
/* Is that too large for dsa_pointer's addressing scheme? */
```
I tested my version of fix with the test script you attached in your next
email, and the test passed. And I feel my version is easier to understand.
But I haven’t thought deeper if your version really waste the padding area, if
not, then I agree your version is shorter and neater, only concern is a bit
hard to understand, but that wouldn't be a problem IMO if a proper comment is
added.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/