On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote:

> On 4/1/26 17:32, Zi Yan wrote:
>> On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:
>>
>>> On 3/27/26 16:05, Zi Yan wrote:
>>>>
>>>>
>>>> But I added
>>>>
>>>> +  if (!is_shmem && inode_is_open_for_write(mapping->host))
>>>> +          result = SCAN_FAIL;
>>>>
>>>> That keeps the original bail out, right?
>>>
>>> Independent of that, are we sure that the possible race we allow is ok?
>>
>> Let me think.
>>
>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
>> inode->i_writecount atomically and it turns inode_is_open_for_write()
>> to true. Then, do_dentry_open() also truncates all pages
>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
>> a fd with write when there is a read-only THP.
>>
>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
>> on FSes with large folio support (to be precise THP support). If a fd
>> is opened for write before inode_is_open_for_write() check, khugepaged
>> will stop. It is fine. But if a fd is opened for write after
>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only
>> THP and the fd can be written at the same time.
>
> Exactly, that's the race I mean.
>
>>
>> I notice that fd write requires locking the to-be-written folio first
>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
>> f_ops->write() has the same locking requirement) and khugepaged has already
>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
>> fd is opened for write after inode_is_open_for_write() check, its write
>> will wait for khugepaged collapse and see a new THP. Since the FS
>> supports THP, writing to the new THP should be fine.
>>
>> Let me know if my analysis above makes sense. If yes, I will add it
>> to the commit message and add a succinct comment about it before
>> inode_is_open_for_write().
>
> khugepaged code is the only code that replaces folios in the pagecache
> by other folios. So my main concern is if that is problematic on
> concurrent write access.

folio_split() does it too, although it replaces a large folio with
a bunch of after-split folios. It is a kinda reverse process of
collapse_file().


>
> You argue that the folio lock is sufficient. That's certainly true for
> individual folios, but I am more concerned about the replacement part.

For the replacement part, both old and new folios are locked during
the process. A parallel writer uses filemap_get_entry() to get the folio
from mapping, but all of them check folio->mapping after acquiring the
folio lock, except mincore_page() which is a reader. A writer can see
either old folio or new folio during the process, but

1. if it sees the old one, it waits on the old folio lock. After
it acquires the lock, it sees old_folio->mapping is NULL, no longer
matches the original mapping. The writer will try again.

2. if it sees the new one, it waits on the new folio lock. After
it acquires the lock, it sees new_folio->mapping matches the
original mapping and proceeds to its writes.

3. if khugepaged needs to do a rollback, the old folio will stay
the same and the writer will see the old one after it gets the old
folio lock.

>
> I don't have anything concrete, primarily just pointing out that this is
> a change that might unlock some code paths that could not have been
> triggered before.

Yes, the concern makes sense.

BTW, Claude is trying to convince me that even inode_is_open_for_write()
is unecessary since 1) folio_test_dirty() before it has
made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents
further writes.

But then we find a hole between folio_test_dirty() and
try_to_unmap() where a write via a writable mmap PTE can dirty the folio
after folio_test_dirty() and try_to_unmap(). To remove that hole,
the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))”
needs to be moved after try_to_unmap(). With that, all to-be-collapsed
folios will be clean, unmapped, and locked, where unmapped means
writes via mmap need to fault and take the folio lock, locked means
writes via mmap and write() need to wait until the folio is unlocked.

Let me know if my reasoning makes sense. It is definitely worth the time
and effort to ensure this patchset does not introduce any unexpected race
condition or issue.

Best Regards,
Yan, Zi

Reply via email to