On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:
> On 3/27/26 16:05, Zi Yan wrote:
>> On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:
>>
>>> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>>>>
>>>> There could now be a race between collapsing and the file getting opened
>>>> r/w.
>>>>
>>>> Are we sure that all code can really deal with that?
>>>>
>>>> IOW, "they already had to handle it separately" -- is that true?
>>>> khugepaged would have never collapse in writable files, so I wonder if
>>>> all code paths are prepared for that.
>>>
>>> OK I guess I overlooked a part of this code... :) see below.
>>>
>>> This is fine and would be a no-op anyway
>>>
>>> - if (f->f_mode & FMODE_WRITE) {
>>> - /*
>>> - * Depends on full fence from get_write_access() to
>>> synchronize
>>> - * against collapse_file() regarding i_writecount and
>>> nr_thps
>>> - * updates. Ensures subsequent insertion of THPs into the
>>> page
>>> - * cache will fail.
>>> - */
>>> - if (filemap_nr_thps(inode->i_mapping)) {
>>>
>>> But this:
>>>
>>> - if (!is_shmem) {
>>> - filemap_nr_thps_inc(mapping);
>>> - /*
>>> - * Paired with the fence in do_dentry_open() ->
>>> get_write_access()
>>> - * to ensure i_writecount is up to date and the update to
>>> nr_thps
>>> - * is visible. Ensures the page cache will be truncated if
>>> the
>>> - * file is opened writable.
>>> - */
>>> - smp_mb();
>>>
>>> We can drop barrier
>>>
>>> - if (inode_is_open_for_write(mapping->host)) {
>>> - result = SCAN_FAIL;
>>>
>>> But this is a functional change!
>>>
>>> Yup missed this.
>>
>> But I added
>>
>> + if (!is_shmem && inode_is_open_for_write(mapping->host))
>> + result = SCAN_FAIL;
>>
>> That keeps the original bail out, right?
>
> Independent of that, are we sure that the possible race we allow is ok?
Let me think.
do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
inode->i_writecount atomically and it turns inode_is_open_for_write()
to true. Then, do_dentry_open() also truncates all pages
if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
a fd with write when there is a read-only THP.
After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
on FSes with large folio support (to be precise THP support). If a fd
is opened for write before inode_is_open_for_write() check, khugepaged
will stop. It is fine. But if a fd is opened for write after
inode_is_open_for_write() check, khugepaged will try to collapse a read-only
THP and the fd can be written at the same time.
I notice that fd write requires locking the to-be-written folio first
(I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
f_ops->write() has the same locking requirement) and khugepaged has already
locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
fd is opened for write after inode_is_open_for_write() check, its write
will wait for khugepaged collapse and see a new THP. Since the FS
supports THP, writing to the new THP should be fine.
Let me know if my analysis above makes sense. If yes, I will add it
to the commit message and add a succinct comment about it before
inode_is_open_for_write().
Best Regards,
Yan, Zi