* hurd/libdiskfs.mdwn: add a short summary paragraph. * hurd/libdiskfs/journal.mdwn: new file. --- hurd/libdiskfs.mdwn | 10 +- hurd/libdiskfs/journal.mdwn | 238 ++++++++++++++++++++++++++++++++++++ 2 files changed, 247 insertions(+), 1 deletion(-) create mode 100644 hurd/libdiskfs/journal.mdwn
diff --git a/hurd/libdiskfs.mdwn b/hurd/libdiskfs.mdwn index dd499785..c939905b 100644 --- a/hurd/libdiskfs.mdwn +++ b/hurd/libdiskfs.mdwn @@ -1,4 +1,4 @@ -[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] +[[!meta copyright="Copyright © 2011, 2025 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this @@ -8,6 +8,14 @@ Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] +Hurd developers use `libdiskfs` to write filesystems like +[[translator/ext2fs]] and [[translator/fatfs]]. `libdiskfs` does +suffer from [[locking +issues|community/gsoc/project_ideas/libdiskfs_locking]]. In the +summer of 2025, Milos Nikic began adding a metadata +[[libdiskfs/journal]]. So far one can only use the journal for ext2fs. +It is not compatible with ext3 or ext4's journal. + # Paging diff --git a/hurd/libdiskfs/journal.mdwn b/hurd/libdiskfs/journal.mdwn new file mode 100644 index 00000000..f2bf70f5 --- /dev/null +++ b/hurd/libdiskfs/journal.mdwn @@ -0,0 +1,238 @@ +[[!meta copyright="Copyright © 2025 Free Software Foundation, Inc."]] + +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable +id="license" text="Permission is granted to copy, distribute and/or modify this +document under the terms of the GNU Free Documentation License, Version 1.2 or +any later version published by the Free Software Foundation; with no Invariant +Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license +is included in the section entitled [[GNU Free Documentation +License|/fdl]]."]]"""]] + +In the summer of 2025, Milos Nikic began working on a metadata +journaling subsystem for libdiskfs, which he started using with +ext2fs. His prototype journal stores metadata changes to raw disk +space outside of the ext2 filesystem but within the same partition. +On boot, before fsck runs, the journal is replayed to fix +inconsistencies. This journal should fix most issues that hard +shutdowns cause. Hopefully the ASCII art below is helpful. + + |-------------+-------------+-------------| + | partition 1 | partition 2 | partition 3 | + |-------------+-------------+-------------| + | begin ext2 | begin ext2 | | + | journal | journal | | + | config data | config data | | + | | | | + | / | /home | swap | + | | | | + | end ext2 | end ext2 | | + |-------------+-------------| | + | journal in | journal in | | + | raw disk | raw disk | | + | space. 8MiB | space. 8MiB | | + |-------------+-------------+-------------| + +The journal is *not* a replacement for fsck, checksumming, ext4-style +transactions, or a strong consistency guarantee. It’s a *best-effort*, +*do-no-harm* crash-recovery helper that complements fsck by restoring +metadata and paths opportunistically. This journal is not compatible +with ext3 or ext4's journal. + +The journaling subsystem writes metadata changes to a reserved raw +disk area outside the ext2-managed region. The location and size are +discovered from `journal_hint` inside ext2 superblock at boot. +Entries are written in a compact binary format with CRC32 protection, +stored in a circular buffer. Early-boot replay reads the journal, +validates entries, and applies the most recent consistent metadata +state to the filesystem, including restoration of deleted or modified +files and directories. The subsystem has been stress-tested (git +checkout, bulk deletions, crash/reboot loops) and successfully +preserves and replays metadata. + +Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and +flags—i.e. metadata fields that can be restored without needing full +path knowledge. + +The journaling system is structured around a single public entry point +`libdiskfs/journal.c` and `libdiskfs/journal.h`. All other components +are internal to libdiskfs. Configuration data (offset, size, etc.) is +written in four reserved fields in the ext2 superblock. + +The journal captures all the major file system operations, yet not all +of them are used for replay for now. + +## Design details + +* Two write modes: + + * Sync (default): blocking write; caller waits for journal flush. + + * Async fallback: used only if writing fails (e.g., file temporarily + unavailable); entries are queued and flushed later. + +* Journal file format: + + * Ring buffer + + * Magic/version checked + + * CRC32-protected header and entries + +* Boot-time replay: + + * During early boot, pread/write are unavailable. Instead, the replay + code uses `_diskfs_rdwr_internal` to safely read the journal. + + * Memory use during replay is controlled via fixed-size arenas. + +* Replay logic: + + * Parsed entries are sorted and deduplicated via a graph. + + * Metadata is only restored if the journaled update is newer than the + current inode `mtime`, and the values differ. It uses strong + fingerprinting to prevent misapplying updates after inode reuse. + + * Replay is dual-path: inode-based first, falling back to path-based + when needed. + + * “Best effort” file recreation under `/restore/[timestamp]` with + correct metadata when files vanish after a crash. + +* Noise filtering: + + * A hardcoded inode range excludes `/dev/random`, `/dev/null`, and other + noisy devices that would otherwise spam the journal. + + * The filter contains a dedicated policy module to filter out noisy +events (`/tmp`, build outputs, etc.). + +*Two tricky problems took significant work:* + + 1. *Path recovery:* `cred->po->path` often gives useful file paths, but + sometimes needs sanitizing or is imprecise. Combined with the current + name, it’s often enough to reconstruct missing files. Replay now uses + path-based recovery when inode-based recovery fails. + + 2. *Aggressive inode reuse in ext2:* After deletion (say at fsck time, or + any time really) the same inode number may be reassigned to a completely + different file after reboot. Fingerprinting ensures we never apply stale + updates to the wrong file. + +## Testing & results + +- Survived repeated hard reboots under concurrent create/delete stress. + +- In chaos tests where fsck over-deleted files, journaling replay brought +them back as expected. + +## *Future work ideas* + +- Better path preservation to improve replay accuracy. + +- Per-node timelines for smarter change grouping. + +- Integration with ext tooling to support formatting with journaling fields +and an 8 MiB carve-out. + +- Exporting replay stats via /proc-like interface. + + * Skip metadata updates for files/directories matching patterns: + + * Paths like `/.git/`, `/build/`, etc. + + * Extensions like .o, .a, ~, .swp + + * Eventually user-configurable via static list or user-supplied config. + +## How to use this metadata journal + + To use the journal one must reserve an 8 MiB space outside the ext2 + filesystem, but within its partition and write the journaling hints + into the ext2 superblock. + +This means the journal will live immediately after ext2 stops on disk. + +1. Shrink the ext2 filesystem by 8 MiB + +We’ll work directly on the image, so make a backup first. +First, find the ext2 partition start offset. + + $ parted -sm debian-hurd.img unit B print + +Example output: + + 2:1000341504B:4194303999B:3193962496B:ext2::; + +The first number after 2: is the byte offset where the ext2 partition starts (1000341504 here). + +- Attach the ext2 partition as a loop device + + # losetup -o 1000341504 --show -f debian-hurd.img + +This prints something like `/dev/loop0` (use whatever it returns). +Check current block count (these are 4 KiB ext2 blocks) + + # tune2fs -l /dev/loop0 | grep 'Block count' + +Example output : + + Block count: 1035776 + +Shrink by 8 MiB + + 8 MiB = 8192 KiB → 8192 / 4 = 2048 ext2 blocks + + New block count = 1035776 − 2048 = 1033728 + + # e2fsck -f /dev/loop0 (accept everything it asks) + # resize2fs /dev/loop0 1033728 + +Replace `1033728` with your calculated value. +Verify + + # tune2fs -l /dev/loop0 | grep 'Block count' + +The number should be exactly 2048 less than the original. +Detach loop device + + # losetup -d /dev/loop0 + +2 Write the journaling hint to the superblock + +The ext2 superblock is 1024 bytes from the start of the partition. +The journaling hint is at offset 264 bytes from the start of the superblock. + +You can verify ext2 magic first (0x53ef) like so: + + $ xxd -g1 -s $((1000342528 + 0x38)) -l 2 debian-hurd.img + +(needs to print "53 ef") + +Instead of doing all the byte math manually, use the attached script: +Show current hint + + $ ./journal-hint.sh debian-hurd.img show + +enable journaling hint: + + $ ./journal-hint.sh debian-hurd.img on + +(This assumes the journal lives in the last 8 MiB of partition 2 (safe after the shrink)) +Disable journaling hint + + $ ./journal-hint.sh debian-hurd.img off + +The script verifies ext2 magic before touching anything. +If the magic doesn’t match, it bails to prevent corruption. + +Safety first: Always work on a copy of your disk image. If the script +writes incorrect offsets, the low-level writer will overwrite whatever +is there, potentially corrupting your system! Make sure the journal +location is outside the filesystem by following the shrink procedure +above. + +Status: + +* `debian-hurd-20230608.img` — tested and works great. +* `debian-hurd-20250622.img` — tested and works great. -- 2.50.1