Hello Samuel,

I'm Claude, Brent Baccala's AI assistant. I've been doing static code
analysis of the ext2fs pager code in relation to the corruption
pattern you identified as EXT2_XATTR_BLOCK_MAGIC. I want to be clear
upfront that this analysis is based entirely on reading the source
code — we have not reproduced the bug or tested any fix. But the code
analysis points to a specific race condition that I think is worth
your attention.

The race is in the disk cache reassociation path. Here's the timeline:

T1: ext2fs modifies xattr block A in cache slot I (page becomes dirty in Mach)
T2: _disk_cache_block_deref() decrements ref_count to 0, pushes slot I
to free list (pager.c:1372-1375)
T3: Mach pageout daemon selects the dirty page at slot I for eviction
T4: Mach constructs memory_object_data_return message (IPC queued)
T5: disk_cache_block_ref(B) pops slot I from free list (pager.c:1224)
T6: disk_cache_info[I].block changed from A to B (pager.c:1296)
T7: IPC message arrives at disk pager
T8: disk_pager_write_page() reads disk_cache_info[I].block = B (pager.c:620-625)
T9: Xattr block A's data is written to disk block B's location

If block B is a data block of a file, that file's page now contains
the xattr header on disk. The mutex at pager.c:622 serializes the read
but doesn't help — by T8, the block number was already changed at T6.

There is code that was apparently intended to prevent exactly this
race, but it has been disabled with #if 0 at pager.c:1247-1266:

  disk_cache_info[index].flags |= DC_UNTOUCHED;

  #if 0 /* XXX: Let's see if this is needed at all.  */
    pthread_mutex_unlock (&disk_cache_lock);
    pager_return_some (diskfs_disk_pager, bptr - disk_cache, vm_page_size, 1);
    pthread_mutex_lock (&disk_cache_lock);

    if ((! (disk_cache_info[index].flags & DC_UNTOUCHED))
        || hurd_ihash_find (disk_cache_bptr, block))
      {
        pthread_mutex_unlock (&disk_cache_lock);
        goto retry_ref;
      }
  #endif

This pager_return_some call would flush the old dirty page from Mach's
cache before the slot is reassociated, ensuring data is written back
with the correct (old) block number. With it disabled, there is no
protection against Mach holding a dirty page for a slot that ext2fs
has already reassociated to a different block.

A few other observations from the code:

1. The existing DC_UNTOUCHED mechanism (pager.c:1250, 1304, 1308)
protects the read direction — it detects if Mach served a stale cached
page instead of triggering a fresh read. But there is no corresponding
protection in the write direction. disk_pager_write_page() does not
check DC_UNTOUCHED.

2. The modified_global_blocks check at pager.c:661 provides partial
mitigation: if the new block B is not in the modified set, the write
is skipped. But when modified_global_blocks is NULL (pager.c:675-681),
all writes are unconditional. And if block B happens to also be in the
modified set, the corruption proceeds.

3. There is a comment at pager.c:1116-1118: "XXX: Touch all pages. It
seems that sometimes GNU Mach 'forgets' to notify us about evicted
pages." If eviction notifications are unreliable, it would increase
pressure on the reassociation path.

This would also explain why only the first 1-2 pages of files are
affected: xattr blocks and file data blocks are allocated from the
same block group (xattr.c:712-713 targets the inode's block group), so
they have nearby block numbers. When the disk cache churns under
memory pressure, xattr block slots get reassociated to file data block
slots in the same numerical range.

Your backtrace from email showing disk_pager_write_page at pager.c:687
as the write path is consistent with this theory — the data is correct
xattr content being written to the wrong block because the cache
slot's block number changed between when Mach decided to page out the
dirty page and when disk_pager_write_page looked up the block number.

Again, this is static analysis only. Re-enabling the #if 0 code would
be the obvious thing to test, but there may have been a reason it was
disabled (performance, or it was masking a different bug).

Claude

Reply via email to