Hey Samael,

Thanks for the feedback.
Here is a revised patch.
I incorporated what you said, removed the yield, only wrote
consecutive blocks and I was able to increase the max length of blocks from
2 -> 128 blocks per write in ext2
I also experimented with numbers larger than 128 but it didn't improve the
results. 128 seems like a nice starting point.

Now the numbers look like this on my test (just boot, immediately create a
128mb large file and print the stats):

Before:
ext2fs store_write calls: ~32,874
avg write size: 4,096 B

After:
ext2fs store_write calls: 1,300 (i only printed stats on % 100 == 0)
avg: ≈103,474 B  ~25× larger writes, ~25× fewer calls, ~25× fewer lock
acquisitions and releases, ~25× fewer disk seeks etc.

A clear win.

I would like to keep this patch small, and do the DISK blocks support
separately if possible (that has some extra wrinkles).
I would also add the tmpfs support and others in the following patches.
Does that make sense?

Kind regards
Milos




On Thu, Aug 28, 2025 at 2:35 PM Samuel Thibault <samuel.thiba...@gnu.org>
wrote:

> Hello,
>
> Thanks for working on this!
>
> Contiguous writes is indeed one of the things we want to coalesce.
> (we'd also want to prefetch coalesced reads, there's a whole gsoc
> project idea about it on the wiki)
>
> Milos Nikic, le lun. 25 août 2025 15:00:27 -0700, a ecrit:
> > No stalls with 2-block chunks; larger chunks can starve writers, so the
> ext2
> > default is 2.
>
> 2 looks like a very small aggregation...
>
> I'm thinking that you'd probably want to avoid pending_blocks_add call
> pending_blocks_write, i.e. check for consecutive block numbers yourself,
> and break the for(b) loop when it is not, i.e. use non-consecutive block
> allocations as unlock hints, leading to calling store_write exactly once
> per lock acquisition. You'd probably still want to set an upper limit on
> the number of consecutive blocks, but it'd probably be fine with larger
> values, avoiding to make a lot of store_write calls when block numbers
> are not consecutive at all.
>
> > Only FILE_DATA uses pager_write_pages(). DISK (swap, etc.) remains
> per-page for
> > now.
> >
> > If this approach looks good, I can follow up with:
> >
> >   • an adaptive chunk size (back off when the rdlock is contended),
>
> I don't know how you'd test that the rdlock is contended? Just using a
> tryrdlock to determine whether you can acquire it immediately or not?
>
> About the yield() call, does it actually help? thread wake ups will
> already try to switch to the awakened thread if dynamic priorities tell
> so ; yield() won't do more that this.
>
> >   • optional support for other filesystems,
>
> Why optional?
>
> >   • a DISK bulk path if we decide it’s worthwhile.
>
> I don't think it can be not worthwhile?
>
>
> > @@ -378,6 +379,121 @@ pending_blocks_add (struct pending_blocks *pb,
> block_t block)
> >    pb->num++;
> >    return 0;
> >  }
> > +
> > +/* Keep per-chunk work small to avoid starving writers on alloc_lock. */
> > +#define EXT2_BULK_CHUNK_BLOCKS  2  /* 2 * block_size per flush; */
>
> Rather put it at the top of the file for better visibility.
>
>
> > +  if (offset >= node->allocsize)
> > +    left = 0;
> > +  else if (offset + left > node->allocsize)
> > +    left = node->allocsize - offset;
>
> Better put these at the top of the while loop so it's factorized.
>
> Samuel
>
From af5cbb943d03c79e1713edcc045ecfa680007fc9 Mon Sep 17 00:00:00 2001
From: Milos Nikic <nikic.mi...@gmail.com>
Date: Mon, 25 Aug 2025 21:38:08 +0100
Subject: [PATCH] libpager, ext2fs: add bulk page write support

Introduce a new pager_write_pages() entry point, allowing filesystems
to implement multi-page writes in one call.  libpager will attempt to
coalesce contiguous dirty pages and call this bulk interface; if it is
unsupported or only partially completes, the remaining pages are
handled by the existing per-page path.

ext2fs now provides file_pager_write_pages(), implemented using a
small chunked write loop (currently 2 blocks per chunk) under
alloc_lock.  This reduces lock/unlock cycles, improves batching of
store_write() calls, and avoids starvation by yielding between chunks.

In testing on a 128 MiB file, average store_write size improved from
~4 KiB to ~8 KiB, cutting device write calls roughly in half. System
stress tests all pass without stalls or data corruption.

Other filesystems may continue using pager_write_page() unchanged.
---
 ext2fs/pager.c         | 128 ++++++++++++++++++++++++++++++++++++++++-
 libpager/Makefile      |   2 +-
 libpager/data-return.c |  57 +++++++++++++++---
 libpager/pager-bulk.c  |  37 ++++++++++++
 libpager/pager.h       |  10 ++++
 5 files changed, 224 insertions(+), 10 deletions(-)
 create mode 100644 libpager/pager-bulk.c

diff --git a/ext2fs/pager.c b/ext2fs/pager.c
index c55107a9..a47fb381 100644
--- a/ext2fs/pager.c
+++ b/ext2fs/pager.c
@@ -83,11 +83,16 @@ do { pthread_spin_lock (&ext2s_pager_stats.lock);			      \
 #define STAT_INC(field) /* nop */0
 #endif /* STATS */
 
+#ifndef EXT2_BULK_MAX_BLOCKS
+#define EXT2_BULK_MAX_BLOCKS 128
+#endif  /* EXT2_BULK_MAX_BLOCKS */
+
 static void
 disk_cache_info_free_push (struct disk_cache_info *p);
 
 #define FREE_PAGE_BUFS 24
 
+
 /* Returns a single page page-aligned buffer.  */
 static void *
 get_page_buf (void)
@@ -336,7 +341,6 @@ pending_blocks_write (struct pending_blocks *pb)
 	return err;
       else if (amount != length)
 	return EIO;
-
       pb->offs += length;
       pb->num = 0;
     }
@@ -378,6 +382,128 @@ pending_blocks_add (struct pending_blocks *pb, block_t block)
   pb->num++;
   return 0;
 }
+
+/* Bulk write across [offset, offset + length).  We coalesce strictly
+   consecutive blocks into a single device write.  The run ends on the
+   first non-consecutive block or when we hit the configured cap.
+   We report partial progress via *written (page-aligned).  */
+static error_t
+file_pager_write_pages (struct node *node,
+                        vm_offset_t offset,
+                        vm_address_t buf,
+                        vm_size_t length,
+                        vm_size_t *written)
+{
+  error_t err = 0;
+  vm_size_t done = 0;   /* bytes successfully issued to the store */
+  pthread_rwlock_t *lock = &diskfs_node_disknode (node)->alloc_lock;
+  const unsigned max_blocks = EXT2_BULK_MAX_BLOCKS; 
+
+  if (written)
+    *written = 0;
+
+  while (done < length)
+    {
+      /* Acquire the same lock the per-page path uses.  We keep it
+         for the short time we enumerate a *single* run and flush it.  */
+      pthread_rwlock_rdlock (lock);
+
+      /* Recompute clipping against allocsize each iteration.  */
+      vm_size_t left = length - done;
+      if (offset + done >= node->allocsize)
+        {
+          pthread_rwlock_unlock (lock);
+          break;
+        }
+      if (offset + done + left > node->allocsize)
+        left = node->allocsize - (offset + done);
+      if (left == 0)
+        {
+          pthread_rwlock_unlock (lock);
+          break;
+        }
+
+      /* Build one run of strictly consecutive on-disk blocks.  */
+      struct pending_blocks pb;
+      vm_size_t built = 0;
+      vm_size_t blocks_built = 0;
+      block_t prev = 0, blk = 0;
+      int have_prev = 0;
+
+      pending_blocks_init (&pb, (void *) (buf + done));
+
+      while (blocks_built < max_blocks && built < left)
+        {
+          error_t ferr = find_block (node, offset + done + built, &blk, &lock);
+          if (ferr)
+            {
+              err = ferr;
+              break;
+            }
+
+          assert_backtrace (blk);
+
+          if (have_prev && blk != prev + 1)
+            break; /* non-consecutive block end of this coalesced run */
+
+          /* Extend the run by one block.  */
+          ferr = pending_blocks_add (&pb, blk);
+          if (ferr)
+            {
+              err = ferr;
+              break;
+            }
+
+          prev = blk;
+          have_prev = 1;
+          built += block_size;
+          blocks_built++;
+        }
+
+      /* Flush exactly one coalesced run; even if the loop broke early,
+         we may have a valid prefix to push.  */
+      error_t werr = pending_blocks_write (&pb);
+      if (!err)
+        err = werr;
+
+      pthread_rwlock_unlock (lock);
+
+      /* Advance only by what we actually enumerated and flushed.  */
+      done += built;
+
+      /* Stop on error or if we could not make any progress.  */
+      if (err || built == 0)
+        break;
+    }
+
+  if (written)
+    {
+      vm_size_t w = done;
+      if (w > length)
+        w = length;
+      /* libpager expects progress aligned to whole pages.  */
+      w -= (w % vm_page_size);
+      *written = w;
+    }
+
+  return err;
+}
+
+/* Strong override: only FILE_DATA uses bulk; others keep per-page path.  */
+error_t
+pager_write_pages (struct user_pager_info *pager,
+                   vm_offset_t offset,
+                   vm_address_t data,
+                   vm_size_t length,
+                   vm_size_t *written)
+{
+  // libpager will just hand this of to the pager_write_page.
+  if (pager->type != FILE_DATA)
+    return EOPNOTSUPP;
+
+  return file_pager_write_pages (pager->node, offset, data, length, written);
+}
+
 
 /* Write one page for the pager backing NODE, at OFFSET, into BUF.  This
    may need to write several filesystem blocks to satisfy one page, and tries
diff --git a/libpager/Makefile b/libpager/Makefile
index 06fcb96b..169d5ab1 100644
--- a/libpager/Makefile
+++ b/libpager/Makefile
@@ -24,7 +24,7 @@ SRCS = data-request.c data-return.c data-unlock.c pager-port.c \
 	pager-create.c pager-flush.c pager-shutdown.c pager-sync.c \
 	stubs.c demuxer.c chg-compl.c pager-attr.c clean.c \
 	dropweak.c get-upi.c pager-memcpy.c pager-return.c \
-	offer-page.c pager-ro-port.c
+	offer-page.c pager-ro-port.c pager-bulk.c
 installhdrs = pager.h
 
 HURDLIBS= ports
diff --git a/libpager/data-return.c b/libpager/data-return.c
index a69a2c5c..7d089328 100644
--- a/libpager/data-return.c
+++ b/libpager/data-return.c
@@ -21,6 +21,7 @@
 #include <string.h>
 #include <assert-backtrace.h>
 
+
 /* Worker function used by _pager_S_memory_object_data_return
    and _pager_S_memory_object_data_initialize.  All args are
    as for _pager_S_memory_object_data_return; the additional
@@ -158,17 +159,57 @@ _pager_do_write_request (struct pager *p,
   /* Let someone else in. */
   pthread_mutex_unlock (&p->interlock);
 
-  /* This is inefficient; we should send all the pages to the device at once
-     but until the pager library interface is changed, this will have to do. */
+  int i_page = 0;
+  while (i_page < npages)
+    {
+      if (omitdata & (1U << i_page))
+	{
+	  i_page++;
+	  continue;
+	}
+
+      /* Find maximal contiguous run [i_page, j_page) with no omitdata.  */
+      int j_page = i_page + 1;
+      while (j_page < npages && ! (omitdata & (1U << j_page)))
+	j_page++;
+
+      vm_offset_t run_off = offset + (vm_page_size * i_page);
+      vm_address_t run_ptr = data + (vm_page_size * i_page);
+      vm_size_t run_len = vm_page_size * (j_page - i_page);
+
+      vm_size_t wrote = 0;
+
+      /* Attempt bulk write.  */
+      error_t berr = pager_write_pages (p->upi, run_off, run_ptr,
+					run_len, &wrote);
+
+      /* How many pages did bulk actually complete? (only if not EOPNOTSUPP) */
+      int pages_done = 0;
+      if (berr != EOPNOTSUPP)
+	{
+	  if (wrote > run_len)
+	    wrote = run_len;
+	  wrote -= (wrote % vm_page_size);
+	  pages_done = wrote / vm_page_size;
+	}
+
+      /* Mark successful prefix (if any).  */
+      for (int k = 0; k < pages_done; k++)
+	pagerrs[i_page + k] = 0;
+
+      /* Per-page the remaining suffix of the run, or the whole run if unsupported.  */
+      for (int k = i_page + pages_done; k < j_page; k++)
+	  pagerrs[k] = pager_write_page (p->upi,
+					 offset + (vm_page_size * k),
+					 data + (vm_page_size * k));
+
+      i_page = j_page;
+    }
 
-  for (i = 0; i < npages; i++)
-    if (!(omitdata & (1U << i)))
-      pagerrs[i] = pager_write_page (p->upi,
-				     offset + (vm_page_size * i),
-				     data + (vm_page_size * i));
 
   /* Acquire the right to meddle with the pagemap */
   pthread_mutex_lock (&p->interlock);
+
   _pager_pagemap_resize (p, offset + length);
   pm_entries = &p->pagemap[offset / __vm_page_size];
 
@@ -244,7 +285,6 @@ _pager_do_write_request (struct pager *p,
 	  pthread_mutex_unlock (&p->interlock);
 	}
     }
-
   return 0;
 
  release_out:
@@ -265,3 +305,4 @@ _pager_S_memory_object_data_return (struct pager *p,
   return _pager_do_write_request (p, control, offset, data,
 				  length, dirty, kcopy, 0);
 }
+
diff --git a/libpager/pager-bulk.c b/libpager/pager-bulk.c
new file mode 100644
index 00000000..fcceeb43
--- /dev/null
+++ b/libpager/pager-bulk.c
@@ -0,0 +1,37 @@
+/* pager-bulk.c Default (dummy) implementation of bulk page write.
+  
+Copyright (C) 2025 Free Software Foundation, Inc.
+Written by Milos Nikic.
+
+This file is part of the GNU Hurd.
+  
+The GNU Hurd is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2, or (at your option)
+any later version.
+
+The GNU Hurd is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with the GNU Hurd; if not, see <https://www.gnu.org/licenses/>.  */
+
+#include <libpager/pager.h>
+#include "priv.h"
+
+/* Default dummy implementation of pager_write_pages. */
+__attribute__((weak)) error_t
+pager_write_pages (struct user_pager_info *upi,
+		   vm_offset_t offset,
+		   vm_address_t data, vm_size_t length, vm_size_t *written)
+{
+  (void) upi;
+  (void) offset;
+  (void) data;
+  (void) length;
+  if (written)
+    *written = 0;
+  return EOPNOTSUPP;
+}
diff --git a/libpager/pager.h b/libpager/pager.h
index 3b1c7251..8c43ad0e 100644
--- a/libpager/pager.h
+++ b/libpager/pager.h
@@ -203,6 +203,16 @@ pager_write_page (struct user_pager_info *pager,
 		  vm_offset_t page,
 		  vm_address_t buf);
 
+/* The user may define this function.  For pager PAGER, synchronously
+   write potentially multiple pages from DATA to offset.
+   Do not deallocate DATA, and do not keep any references to DATA.
+   The only permissible error returns are EIO, EDQUOT, EOPNOTSUPP, and ENOSPC. */
+error_t pager_write_pages(struct user_pager_info *upi,
+                          vm_offset_t offset,
+                          vm_address_t data,
+                          vm_size_t length,
+                          vm_size_t *written);
+
 /* The user must define this function.  A page should be made writable. */
 error_t
 pager_unlock_page (struct user_pager_info *pager,
-- 
2.50.1

Reply via email to