Milos Nikic <nikic.mi...@gmail.com> writes: > Thanks for the write up. It is great. > > Two minor details: > 1) "Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and > +flags" > > There is also uid, gid, and author that should be on that list. That is also > journaled and restored. > > And > 2) +* Two write modes: > + > + * Sync (default): blocking write; caller waits for journal flush. > + > + * Async fallback: used only if writing fails (e.g., file temporarily > + unavailable); entries are queued and flushed later. > > Only sync is on now, async has been removed, so maybe we don't need to > mention it.
Thanks I'll resend the patch! > > Thanks once more! > MIlos > > On Tue, Aug 12, 2025 at 6:56 AM jbra...@dismail.de <jbra...@dismail.de> wrote: > > * hurd/libdiskfs.mdwn: add a short summary paragraph. > * hurd/libdiskfs/journal.mdwn: new file. > --- > hurd/libdiskfs.mdwn | 10 +- > hurd/libdiskfs/journal.mdwn | 238 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 247 insertions(+), 1 deletion(-) > create mode 100644 hurd/libdiskfs/journal.mdwn > > diff --git a/hurd/libdiskfs.mdwn b/hurd/libdiskfs.mdwn > index dd499785..c939905b 100644 > --- a/hurd/libdiskfs.mdwn > +++ b/hurd/libdiskfs.mdwn > @@ -1,4 +1,4 @@ > -[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] > +[[!meta copyright="Copyright © 2011, 2025 Free Software Foundation, Inc."]] > > [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable > id="license" text="Permission is granted to copy, distribute and/or modify > this > @@ -8,6 +8,14 @@ Sections, no Front-Cover Texts, and no Back-Cover Texts. A > copy of the license > is included in the section entitled [[GNU Free Documentation > License|/fdl]]."]]"""]] > > +Hurd developers use `libdiskfs` to write filesystems like > +[[translator/ext2fs]] and [[translator/fatfs]]. `libdiskfs` does > +suffer from [[locking > +issues|community/gsoc/project_ideas/libdiskfs_locking]]. In the > +summer of 2025, Milos Nikic began adding a metadata > +[[libdiskfs/journal]]. So far one can only use the journal for ext2fs. > +It is not compatible with ext3 or ext4's journal. > + > > # Paging > > diff --git a/hurd/libdiskfs/journal.mdwn b/hurd/libdiskfs/journal.mdwn > new file mode 100644 > index 00000000..f2bf70f5 > --- /dev/null > +++ b/hurd/libdiskfs/journal.mdwn > @@ -0,0 +1,238 @@ > +[[!meta copyright="Copyright © 2025 Free Software Foundation, Inc."]] > + > +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable > +id="license" text="Permission is granted to copy, distribute and/or modify > this > +document under the terms of the GNU Free Documentation License, Version 1.2 > or > +any later version published by the Free Software Foundation; with no > Invariant > +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the > license > +is included in the section entitled [[GNU Free Documentation > +License|/fdl]]."]]"""]] > + > +In the summer of 2025, Milos Nikic began working on a metadata > +journaling subsystem for libdiskfs, which he started using with > +ext2fs. His prototype journal stores metadata changes to raw disk > +space outside of the ext2 filesystem but within the same partition. > +On boot, before fsck runs, the journal is replayed to fix > +inconsistencies. This journal should fix most issues that hard > +shutdowns cause. Hopefully the ASCII art below is helpful. > + > + |-------------+-------------+-------------| > + | partition 1 | partition 2 | partition 3 | > + |-------------+-------------+-------------| > + | begin ext2 | begin ext2 | | > + | journal | journal | | > + | config data | config data | | > + | | | | > + | / | /home | swap | > + | | | | > + | end ext2 | end ext2 | | > + |-------------+-------------| | > + | journal in | journal in | | > + | raw disk | raw disk | | > + | space. 8MiB | space. 8MiB | | > + |-------------+-------------+-------------| > + > +The journal is *not* a replacement for fsck, checksumming, ext4-style > +transactions, or a strong consistency guarantee. It’s a *best-effort*, > +*do-no-harm* crash-recovery helper that complements fsck by restoring > +metadata and paths opportunistically. This journal is not compatible > +with ext3 or ext4's journal. > + > +The journaling subsystem writes metadata changes to a reserved raw > +disk area outside the ext2-managed region. The location and size are > +discovered from `journal_hint` inside ext2 superblock at boot. > +Entries are written in a compact binary format with CRC32 protection, > +stored in a circular buffer. Early-boot replay reads the journal, > +validates entries, and applies the most recent consistent metadata > +state to the filesystem, including restoration of deleted or modified > +files and directories. The subsystem has been stress-tested (git > +checkout, bulk deletions, crash/reboot loops) and successfully > +preserves and replays metadata. > + > +Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and > +flags—i.e. metadata fields that can be restored without needing full > +path knowledge. > + > +The journaling system is structured around a single public entry point > +`libdiskfs/journal.c` and `libdiskfs/journal.h`. All other components > +are internal to libdiskfs. Configuration data (offset, size, etc.) is > +written in four reserved fields in the ext2 superblock. > + > +The journal captures all the major file system operations, yet not all > +of them are used for replay for now. > + > +## Design details > + > +* Two write modes: > + > + * Sync (default): blocking write; caller waits for journal flush. > + > + * Async fallback: used only if writing fails (e.g., file temporarily > + unavailable); entries are queued and flushed later. > + > +* Journal file format: > + > + * Ring buffer > + > + * Magic/version checked > + > + * CRC32-protected header and entries > + > +* Boot-time replay: > + > + * During early boot, pread/write are unavailable. Instead, the replay > + code uses `_diskfs_rdwr_internal` to safely read the journal. > + > + * Memory use during replay is controlled via fixed-size arenas. > + > +* Replay logic: > + > + * Parsed entries are sorted and deduplicated via a graph. > + > + * Metadata is only restored if the journaled update is newer than the > + current inode `mtime`, and the values differ. It uses strong > + fingerprinting to prevent misapplying updates after inode reuse. > + > + * Replay is dual-path: inode-based first, falling back to path-based > + when needed. > + > + * “Best effort” file recreation under `/restore/[timestamp]` with > + correct metadata when files vanish after a crash. > + > +* Noise filtering: > + > + * A hardcoded inode range excludes `/dev/random`, `/dev/null`, and other > + noisy devices that would otherwise spam the journal. > + > + * The filter contains a dedicated policy module to filter out noisy > +events (`/tmp`, build outputs, etc.). > + > +*Two tricky problems took significant work:* > + > + 1. *Path recovery:* `cred->po->path` often gives useful file paths, but > + sometimes needs sanitizing or is imprecise. Combined with the current > + name, it’s often enough to reconstruct missing files. Replay now uses > + path-based recovery when inode-based recovery fails. > + > + 2. *Aggressive inode reuse in ext2:* After deletion (say at fsck time, or > + any time really) the same inode number may be reassigned to a completely > + different file after reboot. Fingerprinting ensures we never apply stale > + updates to the wrong file. > + > +## Testing & results > + > +- Survived repeated hard reboots under concurrent create/delete stress. > + > +- In chaos tests where fsck over-deleted files, journaling replay brought > +them back as expected. > + > +## *Future work ideas* > + > +- Better path preservation to improve replay accuracy. > + > +- Per-node timelines for smarter change grouping. > + > +- Integration with ext tooling to support formatting with journaling fields > +and an 8 MiB carve-out. > + > +- Exporting replay stats via /proc-like interface. > + > + * Skip metadata updates for files/directories matching patterns: > + > + * Paths like `/.git/`, `/build/`, etc. > + > + * Extensions like .o, .a, ~, .swp > + > + * Eventually user-configurable via static list or user-supplied config. > + > +## How to use this metadata journal > + > + To use the journal one must reserve an 8 MiB space outside the ext2 > + filesystem, but within its partition and write the journaling hints > + into the ext2 superblock. > + > +This means the journal will live immediately after ext2 stops on disk. > + > +1. Shrink the ext2 filesystem by 8 MiB > + > +We’ll work directly on the image, so make a backup first. > +First, find the ext2 partition start offset. > + > + $ parted -sm debian-hurd.img unit B print > + > +Example output: > + > + 2:1000341504B:4194303999B:3193962496B:ext2::; > + > +The first number after 2: is the byte offset where the ext2 partition > starts (1000341504 here). > + > +- Attach the ext2 partition as a loop device > + > + # losetup -o 1000341504 --show -f debian-hurd.img > + > +This prints something like `/dev/loop0` (use whatever it returns). > +Check current block count (these are 4 KiB ext2 blocks) > + > + # tune2fs -l /dev/loop0 | grep 'Block count' > + > +Example output : > + > + Block count: 1035776 > + > +Shrink by 8 MiB > + > + 8 MiB = 8192 KiB → 8192 / 4 = 2048 ext2 blocks > + > + New block count = 1035776 − 2048 = 1033728 > + > + # e2fsck -f /dev/loop0 (accept everything it asks) > + # resize2fs /dev/loop0 1033728 > + > +Replace `1033728` with your calculated value. > +Verify > + > + # tune2fs -l /dev/loop0 | grep 'Block count' > + > +The number should be exactly 2048 less than the original. > +Detach loop device > + > + # losetup -d /dev/loop0 > + > +2 Write the journaling hint to the superblock > + > +The ext2 superblock is 1024 bytes from the start of the partition. > +The journaling hint is at offset 264 bytes from the start of the superblock. > + > +You can verify ext2 magic first (0x53ef) like so: > + > + $ xxd -g1 -s $((1000342528 + 0x38)) -l 2 debian-hurd.img > + > +(needs to print "53 ef") > + > +Instead of doing all the byte math manually, use the attached script: > +Show current hint > + > + $ ./journal-hint.sh debian-hurd.img show > + > +enable journaling hint: > + > + $ ./journal-hint.sh debian-hurd.img on > + > +(This assumes the journal lives in the last 8 MiB of partition 2 (safe > after the shrink)) > +Disable journaling hint > + > + $ ./journal-hint.sh debian-hurd.img off > + > +The script verifies ext2 magic before touching anything. > +If the magic doesn’t match, it bails to prevent corruption. > + > +Safety first: Always work on a copy of your disk image. If the script > +writes incorrect offsets, the low-level writer will overwrite whatever > +is there, potentially corrupting your system! Make sure the journal > +location is outside the filesystem by following the shrink procedure > +above. > + > +Status: > + > +* `debian-hurd-20230608.img` — tested and works great. > +* `debian-hurd-20250622.img` — tested and works great. > -- > 2.50.1 > -- Joshua Branson Sent from the Hurd