On 2 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote: > On 4/1/26 22:33, Zi Yan wrote: >> On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote: >> >>> On 4/1/26 17:32, Zi Yan wrote: >>>> >>>> >>>> Let me think. >>>> >>>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps >>>> inode->i_writecount atomically and it turns inode_is_open_for_write() >>>> to true. Then, do_dentry_open() also truncates all pages >>>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first >>>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening >>>> a fd with write when there is a read-only THP. >>>> >>>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs >>>> on FSes with large folio support (to be precise THP support). If a fd >>>> is opened for write before inode_is_open_for_write() check, khugepaged >>>> will stop. It is fine. But if a fd is opened for write after >>>> inode_is_open_for_write() check, khugepaged will try to collapse a >>>> read-only >>>> THP and the fd can be written at the same time. >>> >>> Exactly, that's the race I mean. >>> >>>> >>>> I notice that fd write requires locking the to-be-written folio first >>>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume >>>> f_ops->write() has the same locking requirement) and khugepaged has already >>>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if >>>> the >>>> fd is opened for write after inode_is_open_for_write() check, its write >>>> will wait for khugepaged collapse and see a new THP. Since the FS >>>> supports THP, writing to the new THP should be fine. >>>> >>>> Let me know if my analysis above makes sense. If yes, I will add it >>>> to the commit message and add a succinct comment about it before >>>> inode_is_open_for_write(). >>> >>> khugepaged code is the only code that replaces folios in the pagecache >>> by other folios. So my main concern is if that is problematic on >>> concurrent write access. >> >> folio_split() does it too, although it replaces a large folio with >> a bunch of after-split folios. It is a kinda reverse process of >> collapse_file(). > > Right. You won't start looking at a small folio and suddenly there is > something larger. > >> >> >>> >>> You argue that the folio lock is sufficient. That's certainly true for >>> individual folios, but I am more concerned about the replacement part. >> >> For the replacement part, both old and new folios are locked during >> the process. A parallel writer uses filemap_get_entry() to get the folio >> from mapping, but all of them check folio->mapping after acquiring the >> folio lock, except mincore_page() which is a reader. A writer can see >> either old folio or new folio during the process, but >> >> 1. if it sees the old one, it waits on the old folio lock. After >> it acquires the lock, it sees old_folio->mapping is NULL, no longer >> matches the original mapping. The writer will try again. >> >> 2. if it sees the new one, it waits on the new folio lock. After >> it acquires the lock, it sees new_folio->mapping matches the >> original mapping and proceeds to its writes. >> >> 3. if khugepaged needs to do a rollback, the old folio will stay >> the same and the writer will see the old one after it gets the old >> folio lock. > > I am primarily wondering about what would happen if someone traverses > the pageache, and found+processed 3 small folios. Suddenly there is a > large folio that covers the 3 small folios processes before. > > I suspect that is fine, because the code likely had to deal with > concurrent truncation+population if relevant locks are dropped already. > > Just raising it. > >> >>> >>> I don't have anything concrete, primarily just pointing out that this is >>> a change that might unlock some code paths that could not have been >>> triggered before. >> >> Yes, the concern makes sense. >> >> BTW, Claude is trying to convince me that even inode_is_open_for_write() >> is unecessary since 1) folio_test_dirty() before it has >> made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents >> further writes. >> >> But then we find a hole between folio_test_dirty() and >> try_to_unmap() where a write via a writable mmap PTE can dirty the folio >> after folio_test_dirty() and try_to_unmap(). To remove that hole, >> the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))” >> needs to be moved after try_to_unmap(). With that, all to-be-collapsed >> folios will be clean, unmapped, and locked, where unmapped means >> writes via mmap need to fault and take the folio lock, locked means >> writes via mmap and write() need to wait until the folio is unlocked. >> >> Let me know if my reasoning makes sense. It is definitely worth the time >> and effort to ensure this patchset does not introduce any unexpected race >> condition or issue. > > Makes sense. > > Please clearly spell out that there is a slight change now, where we > might be collapsing after the file has been opened for write. Then you > can document that the folio locks should be protecting us from that. > > Implying that collapsing in writable files could likely "easily" done in > the future.
Definitely. Thank you for all the inputs. :) Best Regards, Yan, Zi

