Milos Nikic <nikic.mi...@gmail.com> writes:

> Thanks for the write up. It is great.
>
> Two minor details:
> 1) "Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and
> +flags"
>
> There is also uid, gid, and author that should be on that list. That is also 
> journaled and restored.
>
> And 
> 2) +* Two write modes:
> +
> + * Sync (default): blocking write; caller waits for journal flush.
> +
> + * Async fallback: used only if writing fails (e.g., file temporarily
> +   unavailable); entries are queued and flushed later.
>
> Only sync is on now, async has been removed, so maybe we don't need to 
> mention it.

Thanks I'll resend the patch!

>
> Thanks once more!
> MIlos
>
> On Tue, Aug 12, 2025 at 6:56 AM jbra...@dismail.de <jbra...@dismail.de> wrote:
>
>  * hurd/libdiskfs.mdwn: add a short summary paragraph.
>  * hurd/libdiskfs/journal.mdwn: new file.
>  ---
>   hurd/libdiskfs.mdwn         |  10 +-
>   hurd/libdiskfs/journal.mdwn | 238 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 247 insertions(+), 1 deletion(-)
>   create mode 100644 hurd/libdiskfs/journal.mdwn
>
>  diff --git a/hurd/libdiskfs.mdwn b/hurd/libdiskfs.mdwn
>  index dd499785..c939905b 100644
>  --- a/hurd/libdiskfs.mdwn
>  +++ b/hurd/libdiskfs.mdwn
>  @@ -1,4 +1,4 @@
>  -[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
>  +[[!meta copyright="Copyright © 2011, 2025 Free Software Foundation, Inc."]]
>
>   [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
>   id="license" text="Permission is granted to copy, distribute and/or modify 
> this
>  @@ -8,6 +8,14 @@ Sections, no Front-Cover Texts, and no Back-Cover Texts.  A 
> copy of the license
>   is included in the section entitled [[GNU Free Documentation
>   License|/fdl]]."]]"""]]
>
>  +Hurd developers use `libdiskfs` to write filesystems like
>  +[[translator/ext2fs]] and [[translator/fatfs]].  `libdiskfs` does
>  +suffer from [[locking
>  +issues|community/gsoc/project_ideas/libdiskfs_locking]].  In the
>  +summer of 2025, Milos Nikic began adding a metadata
>  +[[libdiskfs/journal]]. So far one can only use the journal for ext2fs.
>  +It is not compatible with ext3 or ext4's journal.
>  +
>
>   # Paging
>
>  diff --git a/hurd/libdiskfs/journal.mdwn b/hurd/libdiskfs/journal.mdwn
>  new file mode 100644
>  index 00000000..f2bf70f5
>  --- /dev/null
>  +++ b/hurd/libdiskfs/journal.mdwn
>  @@ -0,0 +1,238 @@
>  +[[!meta copyright="Copyright © 2025 Free Software Foundation, Inc."]]
>  +
>  +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
>  +id="license" text="Permission is granted to copy, distribute and/or modify 
> this
>  +document under the terms of the GNU Free Documentation License, Version 1.2 
> or
>  +any later version published by the Free Software Foundation; with no 
> Invariant
>  +Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the 
> license
>  +is included in the section entitled [[GNU Free Documentation
>  +License|/fdl]]."]]"""]]
>  +
>  +In the summer of 2025, Milos Nikic began working on a metadata
>  +journaling subsystem for libdiskfs, which he started using with
>  +ext2fs. His prototype journal stores metadata changes to raw disk
>  +space outside of the ext2 filesystem but within the same partition.
>  +On boot, before fsck runs, the journal is replayed to fix
>  +inconsistencies. This journal should fix most issues that hard
>  +shutdowns cause. Hopefully the ASCII art below is helpful.
>  +
>  +      |-------------+-------------+-------------|
>  +      | partition 1 | partition 2 | partition 3 |
>  +      |-------------+-------------+-------------|
>  +      | begin ext2  | begin ext2  |             |
>  +         | journal     | journal     |             |
>  +         | config data | config data |             |
>  +         |             |             |             |
>  +      | /           | /home       |    swap     |
>  +         |             |             |             |
>  +      | end ext2    | end ext2    |             |
>  +      |-------------+-------------|             |
>  +      | journal in  | journal in  |             |
>  +      | raw disk    | raw disk    |             |
>  +      | space. 8MiB | space. 8MiB |             |
>  +      |-------------+-------------+-------------|
>  +
>  +The journal is *not* a replacement for fsck, checksumming, ext4-style
>  +transactions, or a strong consistency guarantee. It’s a *best-effort*,
>  +*do-no-harm* crash-recovery helper that complements fsck by restoring
>  +metadata and paths opportunistically.  This journal is not compatible
>  +with ext3 or ext4's journal.
>  +
>  +The journaling subsystem writes metadata changes to a reserved raw
>  +disk area outside the ext2-managed region.  The location and size are
>  +discovered from `journal_hint` inside ext2 superblock at boot.
>  +Entries are written in a compact binary format with CRC32 protection,
>  +stored in a circular buffer.  Early-boot replay reads the journal,
>  +validates entries, and applies the most recent consistent metadata
>  +state to the filesystem, including restoration of deleted or modified
>  +files and directories.  The subsystem has been stress-tested (git
>  +checkout, bulk deletions, crash/reboot loops) and successfully
>  +preserves and replays metadata.
>  +
>  +Currently, the journal restores: `mtime`, `ctime`, `st_mode`, and
>  +flags—i.e. metadata fields that can be restored without needing full
>  +path knowledge.
>  +
>  +The journaling system is structured around a single public entry point
>  +`libdiskfs/journal.c` and `libdiskfs/journal.h`. All other components
>  +are internal to libdiskfs. Configuration data (offset, size, etc.) is
>  +written in four reserved fields in the ext2 superblock.
>  +
>  +The journal captures all the major file system operations, yet not all
>  +of them are used for replay for now.
>  +
>  +## Design details
>  +
>  +* Two write modes:
>  +
>  + * Sync (default): blocking write; caller waits for journal flush.
>  +
>  + * Async fallback: used only if writing fails (e.g., file temporarily
>  +   unavailable); entries are queued and flushed later.
>  +
>  +* Journal file format:
>  +
>  + * Ring buffer
>  +
>  + * Magic/version checked
>  +
>  + * CRC32-protected header and entries
>  +
>  +* Boot-time replay:
>  +
>  + * During early boot, pread/write are unavailable. Instead, the replay
>  +   code uses `_diskfs_rdwr_internal` to safely read the journal.
>  +
>  + * Memory use during replay is controlled via fixed-size arenas.
>  +
>  +* Replay logic:
>  +
>  + * Parsed entries are sorted and deduplicated via a graph.
>  +
>  + * Metadata is only restored if the journaled update is newer than the
>  +   current inode `mtime`, and the values differ. It uses strong
>  +   fingerprinting to prevent misapplying updates after inode reuse.
>  +
>  + * Replay is dual-path: inode-based first, falling back to path-based
>  +   when needed.
>  +
>  + * “Best effort” file recreation under `/restore/[timestamp]` with
>  +   correct metadata when files vanish after a crash.
>  +
>  +* Noise filtering:
>  +
>  + * A hardcoded inode range excludes `/dev/random`, `/dev/null`, and other
>  + noisy devices that would otherwise spam the journal.
>  +
>  + * The filter contains a dedicated policy module to filter out noisy
>  +events (`/tmp`, build outputs, etc.).
>  +
>  +*Two tricky problems took significant work:*
>  +
>  +   1. *Path recovery:* `cred->po->path` often gives useful file paths, but
>  +   sometimes needs sanitizing or is imprecise. Combined with the current
>  +   name, it’s often enough to reconstruct missing files. Replay now uses
>  +   path-based recovery when inode-based recovery fails.
>  +
>  +   2. *Aggressive inode reuse in ext2:* After deletion (say at fsck time, or
>  +   any time really) the same inode number may be reassigned to a completely
>  +   different file after reboot. Fingerprinting ensures we never apply stale
>  +   updates to the wrong file.
>  +
>  +## Testing & results
>  +
>  +- Survived repeated hard reboots under concurrent create/delete stress.
>  +
>  +- In chaos tests where fsck over-deleted files, journaling replay brought
>  +them back as expected.
>  +
>  +## *Future work ideas*
>  +
>  +- Better path preservation to improve replay accuracy.
>  +
>  +- Per-node timelines for smarter change grouping.
>  +
>  +- Integration with ext tooling to support formatting with journaling fields
>  +and an 8 MiB carve-out.
>  +
>  +- Exporting replay stats via /proc-like interface.
>  +
>  + * Skip metadata updates for files/directories matching patterns:
>  +
>  +  * Paths like `/.git/`, `/build/`, etc.
>  +
>  +  * Extensions like .o, .a, ~, .swp
>  +
>  +  * Eventually user-configurable via static list or user-supplied config.
>  +
>  +## How to use this metadata journal
>  +
>  + To use the journal one must reserve an 8 MiB space outside the ext2
>  + filesystem, but within its partition and write the journaling hints
>  + into the ext2 superblock.
>  +
>  +This means the journal will live immediately after ext2 stops on disk.
>  +
>  +1. Shrink the ext2 filesystem by 8 MiB
>  +
>  +We’ll work directly on the image, so make a backup first.
>  +First, find the ext2 partition start offset.
>  +
>  +               $ parted -sm debian-hurd.img unit B print
>  +
>  +Example output:
>  +
>  +       2:1000341504B:4194303999B:3193962496B:ext2::;
>  +
>  +The first number after 2: is the byte offset where the ext2 partition 
> starts (1000341504 here).
>  +
>  +- Attach the ext2 partition as a loop device
>  +
>  +               # losetup -o 1000341504 --show -f debian-hurd.img
>  +
>  +This prints something like `/dev/loop0` (use whatever it returns).
>  +Check current block count (these are 4 KiB ext2 blocks)
>  +
>  +       # tune2fs -l /dev/loop0 | grep 'Block count'
>  +
>  +Example output :
>  +
>  +       Block count:              1035776
>  +
>  +Shrink by 8 MiB
>  +
>  +    8 MiB = 8192 KiB → 8192 / 4 = 2048 ext2 blocks
>  +
>  +    New block count = 1035776 − 2048 = 1033728
>  +
>  +       # e2fsck -f /dev/loop0 (accept everything it asks)
>  +       # resize2fs /dev/loop0 1033728
>  +
>  +Replace `1033728` with your calculated value.
>  +Verify
>  +
>  +    # tune2fs -l /dev/loop0 | grep 'Block count'
>  +
>  +The number should be exactly 2048 less than the original.
>  +Detach loop device
>  +
>  +       # losetup -d /dev/loop0
>  +
>  +2  Write the journaling hint to the superblock
>  +
>  +The ext2 superblock is 1024 bytes from the start of the partition.
>  +The journaling hint is at offset 264 bytes from the start of the superblock.
>  +
>  +You can verify ext2 magic first (0x53ef) like so:
>  +
>  +       $ xxd -g1 -s $((1000342528 + 0x38)) -l 2 debian-hurd.img
>  +
>  +(needs to print "53 ef")
>  +
>  +Instead of doing all the byte math manually, use the attached script:
>  +Show current hint
>  +
>  +       $ ./journal-hint.sh debian-hurd.img show
>  +
>  +enable journaling hint:
>  +
>  +       $ ./journal-hint.sh debian-hurd.img on
>  +
>  +(This assumes the journal lives in the last 8 MiB of partition 2 (safe 
> after the shrink))
>  +Disable journaling hint
>  +
>  +       $ ./journal-hint.sh debian-hurd.img off
>  +
>  +The script verifies ext2 magic before touching anything.
>  +If the magic doesn’t match, it bails to prevent corruption.
>  +
>  +Safety first: Always work on a copy of your disk image. If the script
>  +writes incorrect offsets, the low-level writer will overwrite whatever
>  +is there, potentially corrupting your system! Make sure the journal
>  +location is outside the filesystem by following the shrink procedure
>  +above.
>  +
>  +Status:
>  +
>  +* `debian-hurd-20230608.img` — tested and works great.
>  +* `debian-hurd-20250622.img` — tested and works great.
>  -- 
>  2.50.1
>

-- 

Joshua Branson
Sent from the Hurd

Reply via email to