Hey Samael, Thanks for the feedback. Here is a revised patch. I incorporated what you said, removed the yield, only wrote consecutive blocks and I was able to increase the max length of blocks from 2 -> 128 blocks per write in ext2 I also experimented with numbers larger than 128 but it didn't improve the results. 128 seems like a nice starting point.
Now the numbers look like this on my test (just boot, immediately create a 128mb large file and print the stats): Before: ext2fs store_write calls: ~32,874 avg write size: 4,096 B After: ext2fs store_write calls: 1,300 (i only printed stats on % 100 == 0) avg: ≈103,474 B ~25× larger writes, ~25× fewer calls, ~25× fewer lock acquisitions and releases, ~25× fewer disk seeks etc. A clear win. I would like to keep this patch small, and do the DISK blocks support separately if possible (that has some extra wrinkles). I would also add the tmpfs support and others in the following patches. Does that make sense? Kind regards Milos On Thu, Aug 28, 2025 at 2:35 PM Samuel Thibault <samuel.thiba...@gnu.org> wrote: > Hello, > > Thanks for working on this! > > Contiguous writes is indeed one of the things we want to coalesce. > (we'd also want to prefetch coalesced reads, there's a whole gsoc > project idea about it on the wiki) > > Milos Nikic, le lun. 25 août 2025 15:00:27 -0700, a ecrit: > > No stalls with 2-block chunks; larger chunks can starve writers, so the > ext2 > > default is 2. > > 2 looks like a very small aggregation... > > I'm thinking that you'd probably want to avoid pending_blocks_add call > pending_blocks_write, i.e. check for consecutive block numbers yourself, > and break the for(b) loop when it is not, i.e. use non-consecutive block > allocations as unlock hints, leading to calling store_write exactly once > per lock acquisition. You'd probably still want to set an upper limit on > the number of consecutive blocks, but it'd probably be fine with larger > values, avoiding to make a lot of store_write calls when block numbers > are not consecutive at all. > > > Only FILE_DATA uses pager_write_pages(). DISK (swap, etc.) remains > per-page for > > now. > > > > If this approach looks good, I can follow up with: > > > > • an adaptive chunk size (back off when the rdlock is contended), > > I don't know how you'd test that the rdlock is contended? Just using a > tryrdlock to determine whether you can acquire it immediately or not? > > About the yield() call, does it actually help? thread wake ups will > already try to switch to the awakened thread if dynamic priorities tell > so ; yield() won't do more that this. > > > • optional support for other filesystems, > > Why optional? > > > • a DISK bulk path if we decide it’s worthwhile. > > I don't think it can be not worthwhile? > > > > @@ -378,6 +379,121 @@ pending_blocks_add (struct pending_blocks *pb, > block_t block) > > pb->num++; > > return 0; > > } > > + > > +/* Keep per-chunk work small to avoid starving writers on alloc_lock. */ > > +#define EXT2_BULK_CHUNK_BLOCKS 2 /* 2 * block_size per flush; */ > > Rather put it at the top of the file for better visibility. > > > > + if (offset >= node->allocsize) > > + left = 0; > > + else if (offset + left > node->allocsize) > > + left = node->allocsize - offset; > > Better put these at the top of the while loop so it's factorized. > > Samuel >
From af5cbb943d03c79e1713edcc045ecfa680007fc9 Mon Sep 17 00:00:00 2001 From: Milos Nikic <nikic.mi...@gmail.com> Date: Mon, 25 Aug 2025 21:38:08 +0100 Subject: [PATCH] libpager, ext2fs: add bulk page write support Introduce a new pager_write_pages() entry point, allowing filesystems to implement multi-page writes in one call. libpager will attempt to coalesce contiguous dirty pages and call this bulk interface; if it is unsupported or only partially completes, the remaining pages are handled by the existing per-page path. ext2fs now provides file_pager_write_pages(), implemented using a small chunked write loop (currently 2 blocks per chunk) under alloc_lock. This reduces lock/unlock cycles, improves batching of store_write() calls, and avoids starvation by yielding between chunks. In testing on a 128 MiB file, average store_write size improved from ~4 KiB to ~8 KiB, cutting device write calls roughly in half. System stress tests all pass without stalls or data corruption. Other filesystems may continue using pager_write_page() unchanged. --- ext2fs/pager.c | 128 ++++++++++++++++++++++++++++++++++++++++- libpager/Makefile | 2 +- libpager/data-return.c | 57 +++++++++++++++--- libpager/pager-bulk.c | 37 ++++++++++++ libpager/pager.h | 10 ++++ 5 files changed, 224 insertions(+), 10 deletions(-) create mode 100644 libpager/pager-bulk.c diff --git a/ext2fs/pager.c b/ext2fs/pager.c index c55107a9..a47fb381 100644 --- a/ext2fs/pager.c +++ b/ext2fs/pager.c @@ -83,11 +83,16 @@ do { pthread_spin_lock (&ext2s_pager_stats.lock); \ #define STAT_INC(field) /* nop */0 #endif /* STATS */ +#ifndef EXT2_BULK_MAX_BLOCKS +#define EXT2_BULK_MAX_BLOCKS 128 +#endif /* EXT2_BULK_MAX_BLOCKS */ + static void disk_cache_info_free_push (struct disk_cache_info *p); #define FREE_PAGE_BUFS 24 + /* Returns a single page page-aligned buffer. */ static void * get_page_buf (void) @@ -336,7 +341,6 @@ pending_blocks_write (struct pending_blocks *pb) return err; else if (amount != length) return EIO; - pb->offs += length; pb->num = 0; } @@ -378,6 +382,128 @@ pending_blocks_add (struct pending_blocks *pb, block_t block) pb->num++; return 0; } + +/* Bulk write across [offset, offset + length). We coalesce strictly + consecutive blocks into a single device write. The run ends on the + first non-consecutive block or when we hit the configured cap. + We report partial progress via *written (page-aligned). */ +static error_t +file_pager_write_pages (struct node *node, + vm_offset_t offset, + vm_address_t buf, + vm_size_t length, + vm_size_t *written) +{ + error_t err = 0; + vm_size_t done = 0; /* bytes successfully issued to the store */ + pthread_rwlock_t *lock = &diskfs_node_disknode (node)->alloc_lock; + const unsigned max_blocks = EXT2_BULK_MAX_BLOCKS; + + if (written) + *written = 0; + + while (done < length) + { + /* Acquire the same lock the per-page path uses. We keep it + for the short time we enumerate a *single* run and flush it. */ + pthread_rwlock_rdlock (lock); + + /* Recompute clipping against allocsize each iteration. */ + vm_size_t left = length - done; + if (offset + done >= node->allocsize) + { + pthread_rwlock_unlock (lock); + break; + } + if (offset + done + left > node->allocsize) + left = node->allocsize - (offset + done); + if (left == 0) + { + pthread_rwlock_unlock (lock); + break; + } + + /* Build one run of strictly consecutive on-disk blocks. */ + struct pending_blocks pb; + vm_size_t built = 0; + vm_size_t blocks_built = 0; + block_t prev = 0, blk = 0; + int have_prev = 0; + + pending_blocks_init (&pb, (void *) (buf + done)); + + while (blocks_built < max_blocks && built < left) + { + error_t ferr = find_block (node, offset + done + built, &blk, &lock); + if (ferr) + { + err = ferr; + break; + } + + assert_backtrace (blk); + + if (have_prev && blk != prev + 1) + break; /* non-consecutive block end of this coalesced run */ + + /* Extend the run by one block. */ + ferr = pending_blocks_add (&pb, blk); + if (ferr) + { + err = ferr; + break; + } + + prev = blk; + have_prev = 1; + built += block_size; + blocks_built++; + } + + /* Flush exactly one coalesced run; even if the loop broke early, + we may have a valid prefix to push. */ + error_t werr = pending_blocks_write (&pb); + if (!err) + err = werr; + + pthread_rwlock_unlock (lock); + + /* Advance only by what we actually enumerated and flushed. */ + done += built; + + /* Stop on error or if we could not make any progress. */ + if (err || built == 0) + break; + } + + if (written) + { + vm_size_t w = done; + if (w > length) + w = length; + /* libpager expects progress aligned to whole pages. */ + w -= (w % vm_page_size); + *written = w; + } + + return err; +} + +/* Strong override: only FILE_DATA uses bulk; others keep per-page path. */ +error_t +pager_write_pages (struct user_pager_info *pager, + vm_offset_t offset, + vm_address_t data, + vm_size_t length, + vm_size_t *written) +{ + // libpager will just hand this of to the pager_write_page. + if (pager->type != FILE_DATA) + return EOPNOTSUPP; + + return file_pager_write_pages (pager->node, offset, data, length, written); +} + /* Write one page for the pager backing NODE, at OFFSET, into BUF. This may need to write several filesystem blocks to satisfy one page, and tries diff --git a/libpager/Makefile b/libpager/Makefile index 06fcb96b..169d5ab1 100644 --- a/libpager/Makefile +++ b/libpager/Makefile @@ -24,7 +24,7 @@ SRCS = data-request.c data-return.c data-unlock.c pager-port.c \ pager-create.c pager-flush.c pager-shutdown.c pager-sync.c \ stubs.c demuxer.c chg-compl.c pager-attr.c clean.c \ dropweak.c get-upi.c pager-memcpy.c pager-return.c \ - offer-page.c pager-ro-port.c + offer-page.c pager-ro-port.c pager-bulk.c installhdrs = pager.h HURDLIBS= ports diff --git a/libpager/data-return.c b/libpager/data-return.c index a69a2c5c..7d089328 100644 --- a/libpager/data-return.c +++ b/libpager/data-return.c @@ -21,6 +21,7 @@ #include <string.h> #include <assert-backtrace.h> + /* Worker function used by _pager_S_memory_object_data_return and _pager_S_memory_object_data_initialize. All args are as for _pager_S_memory_object_data_return; the additional @@ -158,17 +159,57 @@ _pager_do_write_request (struct pager *p, /* Let someone else in. */ pthread_mutex_unlock (&p->interlock); - /* This is inefficient; we should send all the pages to the device at once - but until the pager library interface is changed, this will have to do. */ + int i_page = 0; + while (i_page < npages) + { + if (omitdata & (1U << i_page)) + { + i_page++; + continue; + } + + /* Find maximal contiguous run [i_page, j_page) with no omitdata. */ + int j_page = i_page + 1; + while (j_page < npages && ! (omitdata & (1U << j_page))) + j_page++; + + vm_offset_t run_off = offset + (vm_page_size * i_page); + vm_address_t run_ptr = data + (vm_page_size * i_page); + vm_size_t run_len = vm_page_size * (j_page - i_page); + + vm_size_t wrote = 0; + + /* Attempt bulk write. */ + error_t berr = pager_write_pages (p->upi, run_off, run_ptr, + run_len, &wrote); + + /* How many pages did bulk actually complete? (only if not EOPNOTSUPP) */ + int pages_done = 0; + if (berr != EOPNOTSUPP) + { + if (wrote > run_len) + wrote = run_len; + wrote -= (wrote % vm_page_size); + pages_done = wrote / vm_page_size; + } + + /* Mark successful prefix (if any). */ + for (int k = 0; k < pages_done; k++) + pagerrs[i_page + k] = 0; + + /* Per-page the remaining suffix of the run, or the whole run if unsupported. */ + for (int k = i_page + pages_done; k < j_page; k++) + pagerrs[k] = pager_write_page (p->upi, + offset + (vm_page_size * k), + data + (vm_page_size * k)); + + i_page = j_page; + } - for (i = 0; i < npages; i++) - if (!(omitdata & (1U << i))) - pagerrs[i] = pager_write_page (p->upi, - offset + (vm_page_size * i), - data + (vm_page_size * i)); /* Acquire the right to meddle with the pagemap */ pthread_mutex_lock (&p->interlock); + _pager_pagemap_resize (p, offset + length); pm_entries = &p->pagemap[offset / __vm_page_size]; @@ -244,7 +285,6 @@ _pager_do_write_request (struct pager *p, pthread_mutex_unlock (&p->interlock); } } - return 0; release_out: @@ -265,3 +305,4 @@ _pager_S_memory_object_data_return (struct pager *p, return _pager_do_write_request (p, control, offset, data, length, dirty, kcopy, 0); } + diff --git a/libpager/pager-bulk.c b/libpager/pager-bulk.c new file mode 100644 index 00000000..fcceeb43 --- /dev/null +++ b/libpager/pager-bulk.c @@ -0,0 +1,37 @@ +/* pager-bulk.c Default (dummy) implementation of bulk page write. + +Copyright (C) 2025 Free Software Foundation, Inc. +Written by Milos Nikic. + +This file is part of the GNU Hurd. + +The GNU Hurd is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 2, or (at your option) +any later version. + +The GNU Hurd is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with the GNU Hurd; if not, see <https://www.gnu.org/licenses/>. */ + +#include <libpager/pager.h> +#include "priv.h" + +/* Default dummy implementation of pager_write_pages. */ +__attribute__((weak)) error_t +pager_write_pages (struct user_pager_info *upi, + vm_offset_t offset, + vm_address_t data, vm_size_t length, vm_size_t *written) +{ + (void) upi; + (void) offset; + (void) data; + (void) length; + if (written) + *written = 0; + return EOPNOTSUPP; +} diff --git a/libpager/pager.h b/libpager/pager.h index 3b1c7251..8c43ad0e 100644 --- a/libpager/pager.h +++ b/libpager/pager.h @@ -203,6 +203,16 @@ pager_write_page (struct user_pager_info *pager, vm_offset_t page, vm_address_t buf); +/* The user may define this function. For pager PAGER, synchronously + write potentially multiple pages from DATA to offset. + Do not deallocate DATA, and do not keep any references to DATA. + The only permissible error returns are EIO, EDQUOT, EOPNOTSUPP, and ENOSPC. */ +error_t pager_write_pages(struct user_pager_info *upi, + vm_offset_t offset, + vm_address_t data, + vm_size_t length, + vm_size_t *written); + /* The user must define this function. A page should be made writable. */ error_t pager_unlock_page (struct user_pager_info *pager, -- 2.50.1