[RFC PATCH] document the experimental libdiskfs journal

jbra...@dismail.de Tue, 12 Aug 2025 06:57:25 -0700

* hurd/libdiskfs.mdwn: add a short summary paragraph.
* hurd/libdiskfs/journal.mdwn: new file.
---
 hurd/libdiskfs.mdwn         |  10 +-
 hurd/libdiskfs/journal.mdwn | 238 ++++++++++++++++++++++++++++++++++++
 2 files changed, 247 insertions(+), 1 deletion(-)
 create mode 100644 hurd/libdiskfs/journal.mdwn


diff --git a/hurd/libdiskfs.mdwn b/hurd/libdiskfs.mdwn
index dd499785..c939905b 100644
--- a/hurd/libdiskfs.mdwn
+++ b/hurd/libdiskfs.mdwn
@@ -1,4 +1,4 @@
-[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
+[[!meta copyright="Copyright © 2011, 2025 Free Software Foundation, Inc."]]
 
 [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
 id="license" text="Permission is granted to copy, distribute and/or modify this
@@ -8,6 +8,14 @@ Sections, no Front-Cover Texts, and no Back-Cover Texts.  A 
copy of the license
 is included in the section entitled [[GNU Free Documentation
 License|/fdl]]."]]"""]]
 
+Hurd developers use `libdiskfs` to write filesystems like
+[[translator/ext2fs]] and [[translator/fatfs]].  `libdiskfs` does
+suffer from [[locking
+issues|community/gsoc/project_ideas/libdiskfs_locking]].  In the
+summer of 2025, Milos Nikic began adding a metadata
+[[libdiskfs/journal]]. So far one can only use the journal for ext2fs.
+It is not compatible with ext3 or ext4's journal.
+
 
 # Paging
 
diff --git a/hurd/libdiskfs/journal.mdwn b/hurd/libdiskfs/journal.mdwn
new file mode 100644
index 00000000..f2bf70f5
--- /dev/null
+++ b/hurd/libdiskfs/journal.mdwn
@@ -0,0 +1,238 @@
+[[!meta copyright="Copyright © 2025 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+In the summer of 2025, Milos Nikic began working on a metadata
+journaling subsystem for libdiskfs, which he started using with
+ext2fs. His prototype journal stores metadata changes to raw disk
+space outside of the ext2 filesystem but within the same partition.
+On boot, before fsck runs, the journal is replayed to fix
+inconsistencies. This journal should fix most issues that hard
+shutdowns cause. Hopefully the ASCII art below is helpful.
+
+      |-------------+-------------+-------------|
+      | partition 1 | partition 2 | partition 3 |
+      |-------------+-------------+-------------|
+      | begin ext2  | begin ext2  |             |
+         | journal     | journal     |             |
+         | config data | config data |             |
+         |             |             |             |
+      | /           | /home       |    swap     |
+         |             |             |             |
+      | end ext2    | end ext2    |             |
+      |-------------+-------------|             |
+      | journal in  | journal in  |             |
+      | raw disk    | raw disk    |             |
+      | space. 8MiB | space. 8MiB |             |
+      |-------------+-------------+-------------|
+
+The journal is *not* a replacement for fsck, checksumming, ext4-style
+transactions, or a strong consistency guarantee. It’s a *best-effort*,
+*do-no-harm* crash-recovery helper that complements fsck by restoring
+metadata and paths opportunistically.  This journal is not compatible
+with ext3 or ext4's journal.
+
+The journaling subsystem writes metadata changes to a reserved raw
+disk area outside the ext2-managed region.  The location and size are
+discovered from `journal_hint` inside ext2 superblock at boot.
+Entries are written in a compact binary format with CRC32 protection,
+stored in a circular buffer.  Early-boot replay reads the journal,
+validates entries, and applies the most recent consistent metadata
+state to the filesystem, including restoration of deleted or modified
+files and directories.  The subsystem has been stress-tested (git
+checkout, bulk deletions, crash/reboot loops) and successfully
+preserves and replays metadata.
+
+Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and
+flags—i.e. metadata fields that can be restored without needing full
+path knowledge.
+
+The journaling system is structured around a single public entry point
+`libdiskfs/journal.c` and `libdiskfs/journal.h`. All other components
+are internal to libdiskfs. Configuration data (offset, size, etc.) is
+written in four reserved fields in the ext2 superblock.
+
+The journal captures all the major file system operations, yet not all
+of them are used for replay for now.
+
+## Design details
+
+* Two write modes:
+
+ * Sync (default): blocking write; caller waits for journal flush.
+
+ * Async fallback: used only if writing fails (e.g., file temporarily
+   unavailable); entries are queued and flushed later.
+
+* Journal file format:
+
+ * Ring buffer
+
+ * Magic/version checked
+
+ * CRC32-protected header and entries
+
+* Boot-time replay:
+
+ * During early boot, pread/write are unavailable. Instead, the replay
+   code uses `_diskfs_rdwr_internal` to safely read the journal.
+
+ * Memory use during replay is controlled via fixed-size arenas.
+
+* Replay logic:
+
+ * Parsed entries are sorted and deduplicated via a graph.
+
+ * Metadata is only restored if the journaled update is newer than the
+   current inode `mtime`, and the values differ. It uses strong
+   fingerprinting to prevent misapplying updates after inode reuse.
+
+ * Replay is dual-path: inode-based first, falling back to path-based
+   when needed.
+
+ * “Best effort” file recreation under `/restore/[timestamp]` with
+   correct metadata when files vanish after a crash.
+
+* Noise filtering:
+
+ * A hardcoded inode range excludes `/dev/random`, `/dev/null`, and other
+ noisy devices that would otherwise spam the journal.
+
+ * The filter contains a dedicated policy module to filter out noisy
+events (`/tmp`, build outputs, etc.).
+
+*Two tricky problems took significant work:*
+
+   1. *Path recovery:* `cred->po->path` often gives useful file paths, but
+   sometimes needs sanitizing or is imprecise. Combined with the current
+   name, it’s often enough to reconstruct missing files. Replay now uses
+   path-based recovery when inode-based recovery fails.
+
+   2. *Aggressive inode reuse in ext2:* After deletion (say at fsck time, or
+   any time really) the same inode number may be reassigned to a completely
+   different file after reboot. Fingerprinting ensures we never apply stale
+   updates to the wrong file.
+
+## Testing & results
+
+- Survived repeated hard reboots under concurrent create/delete stress.
+
+- In chaos tests where fsck over-deleted files, journaling replay brought
+them back as expected.
+
+## *Future work ideas*
+
+- Better path preservation to improve replay accuracy.
+
+- Per-node timelines for smarter change grouping.
+
+- Integration with ext tooling to support formatting with journaling fields
+and an 8 MiB carve-out.
+
+- Exporting replay stats via /proc-like interface.
+
+ * Skip metadata updates for files/directories matching patterns:
+
+  * Paths like `/.git/`, `/build/`, etc.
+
+  * Extensions like .o, .a, ~, .swp
+
+  * Eventually user-configurable via static list or user-supplied config.
+
+## How to use this metadata journal
+
+ To use the journal one must reserve an 8 MiB space outside the ext2
+ filesystem, but within its partition and write the journaling hints
+ into the ext2 superblock.
+
+This means the journal will live immediately after ext2 stops on disk.
+
+1. Shrink the ext2 filesystem by 8 MiB
+
+We’ll work directly on the image, so make a backup first.
+First, find the ext2 partition start offset.
+
+               $ parted -sm debian-hurd.img unit B print
+
+Example output:
+
+       2:1000341504B:4194303999B:3193962496B:ext2::;
+
+The first number after 2: is the byte offset where the ext2 partition starts 
(1000341504 here).
+
+- Attach the ext2 partition as a loop device
+
+               # losetup -o 1000341504 --show -f debian-hurd.img
+
+This prints something like `/dev/loop0` (use whatever it returns).
+Check current block count (these are 4 KiB ext2 blocks)
+
+       # tune2fs -l /dev/loop0 | grep 'Block count'
+
+Example output :
+
+       Block count:              1035776
+
+Shrink by 8 MiB
+
+    8 MiB = 8192 KiB → 8192 / 4 = 2048 ext2 blocks
+
+    New block count = 1035776 − 2048 = 1033728
+
+       # e2fsck -f /dev/loop0 (accept everything it asks)
+       # resize2fs /dev/loop0 1033728
+
+Replace `1033728` with your calculated value.
+Verify
+
+    # tune2fs -l /dev/loop0 | grep 'Block count'
+
+The number should be exactly 2048 less than the original.
+Detach loop device
+
+       # losetup -d /dev/loop0
+
+2  Write the journaling hint to the superblock
+
+The ext2 superblock is 1024 bytes from the start of the partition.
+The journaling hint is at offset 264 bytes from the start of the superblock.
+
+You can verify ext2 magic first (0x53ef) like so:
+
+       $ xxd -g1 -s $((1000342528 + 0x38)) -l 2 debian-hurd.img
+
+(needs to print "53 ef")
+
+Instead of doing all the byte math manually, use the attached script:
+Show current hint
+
+       $ ./journal-hint.sh debian-hurd.img show
+
+enable journaling hint:
+
+       $ ./journal-hint.sh debian-hurd.img on
+
+(This assumes the journal lives in the last 8 MiB of partition 2 (safe after 
the shrink))
+Disable journaling hint
+
+       $ ./journal-hint.sh debian-hurd.img off
+
+The script verifies ext2 magic before touching anything.
+If the magic doesn’t match, it bails to prevent corruption.
+
+Safety first: Always work on a copy of your disk image. If the script
+writes incorrect offsets, the low-level writer will overwrite whatever
+is there, potentially corrupting your system! Make sure the journal
+location is outside the filesystem by following the shrink procedure
+above.
+
+Status:
+
+* `debian-hurd-20230608.img` — tested and works great.
+* `debian-hurd-20250622.img` — tested and works great.
-- 
2.50.1

[RFC PATCH] document the experimental libdiskfs journal

Reply via email to