Hi all,

While working on journaling (looking for a good “sync” point to advance the
journal tail), I hit an obvious inefficiency: libpager always feeds the FS
pagers one page at a time. Even on very large files.

This patch introduces an optional bulk write path in libpager and
implements it in ext2fs. (this patch is stand alone and not part of journal
in any way)
What changed

libpager: add pager_write_pages() hook and have data-return.c coalesce
contiguous dirty pages into runs. It calls the bulk hook first; if it’s
unsupported or only partially completed, the remaining pages fall back to
the existing per-page path (no behavior change for filesystems that don’t
opt in).

ext2fs: implement pager_write_pages() via a small, chunked loop (2
filesystem blocks per chunk) under alloc_lock, flushing each chunk before
yielding and continuing. This reduces lock/unlock churn and batches device
writes while avoiding starvation.

Backwards compatibility is preserved: filesystems that don’t implement the
new hook continue to receive per-page calls exactly as before.
Why (numbers)

On a clean boot, writing a 128 MiB file:

Before

libpager: write_page calls: 32 984
libpager: write_pages calls: 0
ext2fs: store_write calls: 32 877, bytes: 134 664 192, avg: 4096.0 B

After

libpager: write_page calls: 79
libpager: write_pages calls: 1150
ext2fs: store_write calls: 16 510, bytes: 134 258 688, avg: 8132.0 B

So device writes are roughly halved and average write size nearly doubled,
with identical correctness.
Safety / locking

Each chunk enumerates blocks and calls store_write() while holding
alloc_lock (read), just like the per-page path.
After each chunk we briefly unlock and yield before re-acquiring the lock
and re-checking allocsize. We do not pre-enumerate across an unlock. This
avoids starvation under writer-preferential rwlocks.
On partial errors we report partial progress (*written, page-aligned) and
libpager per-pages the tail.
Testing

Booted with the newly built ext2fs.static. I now use it daily.
Ran various stress tests and (git rebase/checkout, chaos scripts, Hurd
builds, power-cut simulations).
No stalls with 2-block chunks; larger chunks can starve writers, so the
ext2 default is 2.
Scope

Only FILE_DATA uses pager_write_pages(). DISK (swap, etc.) remains per-page
for now.

If this approach looks good, I can follow up with:

   -

   an adaptive chunk size (back off when the rdlock is contended),

   -

   optional support for other filesystems, and

   -

   a DISK bulk path if we decide it’s worthwhile.


Thanks for taking a look!

Milos
From 886ed2a3bfdcd16f8491fa58472972a7f48de943 Mon Sep 17 00:00:00 2001
From: Milos Nikic <nikic.mi...@gmail.com>
Date: Mon, 25 Aug 2025 21:38:08 +0100
Subject: [PATCH] Added bulk page write to libpager and ext2fs

---
 ext2fs/pager.c         | 116 +++++++++++++++++++++++++++++++++++++++++
 libpager/Makefile      |   2 +-
 libpager/data-return.c |  57 +++++++++++++++++---
 libpager/pager-bulk.c  |  37 +++++++++++++
 libpager/pager.h       |  10 ++++
 5 files changed, 213 insertions(+), 9 deletions(-)
 create mode 100644 libpager/pager-bulk.c

diff --git a/ext2fs/pager.c b/ext2fs/pager.c
index c55107a9..3ac38d09 100644
--- a/ext2fs/pager.c
+++ b/ext2fs/pager.c
@@ -88,6 +88,7 @@ disk_cache_info_free_push (struct disk_cache_info *p);
 
 #define FREE_PAGE_BUFS 24
 
+
 /* Returns a single page page-aligned buffer.  */
 static void *
 get_page_buf (void)
@@ -378,6 +379,121 @@ pending_blocks_add (struct pending_blocks *pb, block_t block)
   pb->num++;
   return 0;
 }
+
+/* Keep per-chunk work small to avoid starving writers on alloc_lock. */
+#define EXT2_BULK_CHUNK_BLOCKS  2  /* 2 * block_size per flush; */
+
+static error_t
+file_pager_write_pages (struct node *node,
+                        vm_offset_t offset,
+                        vm_address_t buf,
+                        vm_size_t length,
+                        vm_size_t *written)
+{
+  error_t err = 0;
+  vm_size_t done = 0;   /* bytes successfully enumerated & flushed */
+  vm_size_t left;
+  pthread_rwlock_t *lock = &diskfs_node_disknode (node)->alloc_lock;
+
+  if (written)
+    *written = 0;
+
+  pthread_rwlock_rdlock (lock);
+
+  /* Clip to allocsize (same as per-page path). */
+  left = length;
+  if (offset >= node->allocsize)
+    left = 0;
+  else if (offset + left > node->allocsize)
+    left = node->allocsize - offset;
+
+  while (left > 0)
+    {
+      struct pending_blocks pb;
+      vm_size_t blocks_this = EXT2_BULK_CHUNK_BLOCKS;
+      vm_size_t built = 0;
+
+      /* Bound by remaining bytes; keep to whole blocks. */
+      if (blocks_this * (vm_size_t) block_size > left)
+        blocks_this = left / block_size;
+      if (blocks_this == 0)
+        break;
+
+      pending_blocks_init (&pb, (void *) (buf + done));
+
+      /* Enumerate up to blocks_this blocks. */
+      for (vm_size_t b = 0; b < blocks_this; b++)
+        {
+          block_t block;
+
+          err = find_block (node, offset + done + built, &block, &lock);
+          if (err)
+            break;
+
+          /* Per-page code asserts this too. */
+          assert_backtrace (block);
+
+          err = pending_blocks_add (&pb, block);
+          if (err)
+            break;
+
+          built += block_size;
+        }
+
+      /* Flush any accumulated data for this chunk. */
+      {
+        error_t werr = pending_blocks_write (&pb);
+        if (!err)
+          err = werr;
+      }
+
+      done += built;
+      left -= built;
+
+      if (err || left == 0)
+        break;
+
+      /* Briefly yield to let writers acquire alloc_lock if they are queued. */
+      pthread_rwlock_unlock (lock);
+      sched_yield();
+      pthread_rwlock_rdlock (lock);
+
+      /* Re-check allocsize in case of truncate/grow */
+      if (offset + done > node->allocsize)
+        break;
+      if (offset + done + left > node->allocsize)
+        left = node->allocsize - (offset + done);
+    }
+
+  pthread_rwlock_unlock (lock);
+
+  if (written)
+    {
+      vm_size_t w = done;
+      if (w > length)
+        w = length;
+      w -= (w % vm_page_size);
+      *written = w;
+    }
+
+  return err;
+}
+
+/* Strong override: only FILE_DATA uses bulk; others keep per-page path.  */
+error_t
+pager_write_pages (struct user_pager_info *pager,
+                   vm_offset_t offset,
+                   vm_address_t data,
+                   vm_size_t length,
+                   vm_size_t *written)
+{
+  // libpager will just hand this of to the pager_write_page.
+  if (pager->type != FILE_DATA)
+    return EOPNOTSUPP;
+
+  return file_pager_write_pages (pager->node, offset, data, length, written);
+}
+
 
 /* Write one page for the pager backing NODE, at OFFSET, into BUF.  This
    may need to write several filesystem blocks to satisfy one page, and tries
diff --git a/libpager/Makefile b/libpager/Makefile
index 06fcb96b..169d5ab1 100644
--- a/libpager/Makefile
+++ b/libpager/Makefile
@@ -24,7 +24,7 @@ SRCS = data-request.c data-return.c data-unlock.c pager-port.c \
 	pager-create.c pager-flush.c pager-shutdown.c pager-sync.c \
 	stubs.c demuxer.c chg-compl.c pager-attr.c clean.c \
 	dropweak.c get-upi.c pager-memcpy.c pager-return.c \
-	offer-page.c pager-ro-port.c
+	offer-page.c pager-ro-port.c pager-bulk.c
 installhdrs = pager.h
 
 HURDLIBS= ports
diff --git a/libpager/data-return.c b/libpager/data-return.c
index a69a2c5c..7d089328 100644
--- a/libpager/data-return.c
+++ b/libpager/data-return.c
@@ -21,6 +21,7 @@
 #include <string.h>
 #include <assert-backtrace.h>
 
+
 /* Worker function used by _pager_S_memory_object_data_return
    and _pager_S_memory_object_data_initialize.  All args are
    as for _pager_S_memory_object_data_return; the additional
@@ -158,17 +159,57 @@ _pager_do_write_request (struct pager *p,
   /* Let someone else in. */
   pthread_mutex_unlock (&p->interlock);
 
-  /* This is inefficient; we should send all the pages to the device at once
-     but until the pager library interface is changed, this will have to do. */
+  int i_page = 0;
+  while (i_page < npages)
+    {
+      if (omitdata & (1U << i_page))
+	{
+	  i_page++;
+	  continue;
+	}
+
+      /* Find maximal contiguous run [i_page, j_page) with no omitdata.  */
+      int j_page = i_page + 1;
+      while (j_page < npages && ! (omitdata & (1U << j_page)))
+	j_page++;
+
+      vm_offset_t run_off = offset + (vm_page_size * i_page);
+      vm_address_t run_ptr = data + (vm_page_size * i_page);
+      vm_size_t run_len = vm_page_size * (j_page - i_page);
+
+      vm_size_t wrote = 0;
+
+      /* Attempt bulk write.  */
+      error_t berr = pager_write_pages (p->upi, run_off, run_ptr,
+					run_len, &wrote);
+
+      /* How many pages did bulk actually complete? (only if not EOPNOTSUPP) */
+      int pages_done = 0;
+      if (berr != EOPNOTSUPP)
+	{
+	  if (wrote > run_len)
+	    wrote = run_len;
+	  wrote -= (wrote % vm_page_size);
+	  pages_done = wrote / vm_page_size;
+	}
+
+      /* Mark successful prefix (if any).  */
+      for (int k = 0; k < pages_done; k++)
+	pagerrs[i_page + k] = 0;
+
+      /* Per-page the remaining suffix of the run, or the whole run if unsupported.  */
+      for (int k = i_page + pages_done; k < j_page; k++)
+	  pagerrs[k] = pager_write_page (p->upi,
+					 offset + (vm_page_size * k),
+					 data + (vm_page_size * k));
+
+      i_page = j_page;
+    }
 
-  for (i = 0; i < npages; i++)
-    if (!(omitdata & (1U << i)))
-      pagerrs[i] = pager_write_page (p->upi,
-				     offset + (vm_page_size * i),
-				     data + (vm_page_size * i));
 
   /* Acquire the right to meddle with the pagemap */
   pthread_mutex_lock (&p->interlock);
+
   _pager_pagemap_resize (p, offset + length);
   pm_entries = &p->pagemap[offset / __vm_page_size];
 
@@ -244,7 +285,6 @@ _pager_do_write_request (struct pager *p,
 	  pthread_mutex_unlock (&p->interlock);
 	}
     }
-
   return 0;
 
  release_out:
@@ -265,3 +305,4 @@ _pager_S_memory_object_data_return (struct pager *p,
   return _pager_do_write_request (p, control, offset, data,
 				  length, dirty, kcopy, 0);
 }
+
diff --git a/libpager/pager-bulk.c b/libpager/pager-bulk.c
new file mode 100644
index 00000000..f1bec8d8
--- /dev/null
+++ b/libpager/pager-bulk.c
@@ -0,0 +1,37 @@
+/* journal.h - Public interface for journaling metadata events
+  
+Copyright (C) 2025 Free Software Foundation, Inc.
+Written by Milos Nikic.
+
+This file is part of the GNU Hurd.
+  
+The GNU Hurd is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2, or (at your option)
+any later version.
+
+The GNU Hurd is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with the GNU Hurd; if not, see <https://www.gnu.org/licenses/>.  */
+
+#include <libpager/pager.h>
+#include "priv.h"
+
+/* Default dummy implementation of pager_write_pages. */
+__attribute__((weak)) error_t
+pager_write_pages (struct user_pager_info *upi,
+		   vm_offset_t offset,
+		   vm_address_t data, vm_size_t length, vm_size_t *written)
+{
+  (void) upi;
+  (void) offset;
+  (void) data;
+  (void) length;
+  if (written)
+    *written = 0;
+  return EOPNOTSUPP;
+}
diff --git a/libpager/pager.h b/libpager/pager.h
index 3b1c7251..8c43ad0e 100644
--- a/libpager/pager.h
+++ b/libpager/pager.h
@@ -203,6 +203,16 @@ pager_write_page (struct user_pager_info *pager,
 		  vm_offset_t page,
 		  vm_address_t buf);
 
+/* The user may define this function.  For pager PAGER, synchronously
+   write potentially multiple pages from DATA to offset.
+   Do not deallocate DATA, and do not keep any references to DATA.
+   The only permissible error returns are EIO, EDQUOT, EOPNOTSUPP, and ENOSPC. */
+error_t pager_write_pages(struct user_pager_info *upi,
+                          vm_offset_t offset,
+                          vm_address_t data,
+                          vm_size_t length,
+                          vm_size_t *written);
+
 /* The user must define this function.  A page should be made writable. */
 error_t
 pager_unlock_page (struct user_pager_info *pager,
-- 
2.50.1

Reply via email to