On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:

> On 3/27/26 16:05, Zi Yan wrote:
>> On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:
>>
>>> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>>>>
>>>> There could now be a race between collapsing and the file getting opened
>>>> r/w.
>>>>
>>>> Are we sure that all code can really deal with that?
>>>>
>>>> IOW, "they already had to handle it separately" -- is that true?
>>>> khugepaged would have never collapse in writable files, so I wonder if
>>>> all code paths are prepared for that.
>>>
>>> OK I guess I overlooked a part of this code... :) see below.
>>>
>>> This is fine and would be a no-op anyway
>>>
>>> -       if (f->f_mode & FMODE_WRITE) {
>>> -               /*
>>> -                * Depends on full fence from get_write_access() to 
>>> synchronize
>>> -                * against collapse_file() regarding i_writecount and 
>>> nr_thps
>>> -                * updates. Ensures subsequent insertion of THPs into the 
>>> page
>>> -                * cache will fail.
>>> -                */
>>> -               if (filemap_nr_thps(inode->i_mapping)) {
>>>
>>> But this:
>>>
>>> -       if (!is_shmem) {
>>> -               filemap_nr_thps_inc(mapping);
>>> -               /*
>>> -                * Paired with the fence in do_dentry_open() -> 
>>> get_write_access()
>>> -                * to ensure i_writecount is up to date and the update to 
>>> nr_thps
>>> -                * is visible. Ensures the page cache will be truncated if 
>>> the
>>> -                * file is opened writable.
>>> -                */
>>> -               smp_mb();
>>>
>>> We can drop barrier
>>>
>>> -               if (inode_is_open_for_write(mapping->host)) {
>>> -                       result = SCAN_FAIL;
>>>
>>> But this is a functional change!
>>>
>>> Yup missed this.
>>
>> But I added
>>
>> +    if (!is_shmem && inode_is_open_for_write(mapping->host))
>> +            result = SCAN_FAIL;
>>
>> That keeps the original bail out, right?
>
> Independent of that, are we sure that the possible race we allow is ok?

Let me think.

do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
inode->i_writecount atomically and it turns inode_is_open_for_write()
to true. Then, do_dentry_open() also truncates all pages
if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
a fd with write when there is a read-only THP.

After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
on FSes with large folio support (to be precise THP support). If a fd
is opened for write before inode_is_open_for_write() check, khugepaged
will stop. It is fine. But if a fd is opened for write after
inode_is_open_for_write() check, khugepaged will try to collapse a read-only
THP and the fd can be written at the same time.

I notice that fd write requires locking the to-be-written folio first
(I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
f_ops->write() has the same locking requirement) and khugepaged has already
locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
fd is opened for write after inode_is_open_for_write() check, its write
will wait for khugepaged collapse and see a new THP. Since the FS
supports THP, writing to the new THP should be fine.

Let me know if my analysis above makes sense. If yes, I will add it
to the commit message and add a succinct comment about it before
inode_is_open_for_write().


Best Regards,
Yan, Zi

Reply via email to