On 11.07.19 г. 22:52 ч., Chris Mason wrote: > On 11 Jul 2019, at 12:00, Nikolay Borisov wrote: > >> On 10.07.19 г. 22:28 ч., Tejun Heo wrote: >>> From: Chris Mason <c...@fb.com> >>> >>> The btrfs writepages function collects a large range of pages flagged >>> for delayed allocation, and then sends them down through the COW code >>> for processing. When compression is on, we allocate one async_cow >> >> nit: The code no longer uses async_cow to represent in-flight chunks >> but >> the more aptly named async_chunk. Presumably this patchset predates >> those changes. > > Not by much, but yes. > >> >>> >>> The end result is that cowA is completed and cleaned up before cowB >>> even >>> starts processing. This means we can free locked_page() and reuse it >>> elsewhere. If we get really lucky, it'll have the same page->index >>> in >>> its new home as it did before. >>> >>> While we're processing cowB, we might decide we need to fall back to >>> uncompressed IO, and so compress_file_range() will call >>> __set_page_dirty_nobufers() on cowB->locked_page. >>> >>> Without cgroups in use, this creates as a phantom dirty page, which> >>> isn't great but isn't the end of the world. With cgroups in use, we >> >> Having a phantom dirty page is not great but not terrible without >> cgroups but apart from that, does it have any other implications? > > Best case, it'll go through the writepage fixup worker and go through > the whole cow machinery again. Worst case we go to this code more than > once: > > /* > * if page_started, cow_file_range inserted an > * inline extent and took care of all the > unlocking > * and IO for us. Otherwise, we need to submit > * all those pages down to the drive. > */ > if (!page_started && !ret) > extent_write_locked_range(inode, > async_extent->start, > async_extent->start + > async_extent->ram_size > - 1, > WB_SYNC_ALL); > else if (ret) > unlock_page(async_chunk->locked_page); > > > That never happened in production as far as I can tell, but it seems > possible. > >> >> >>> might crash in the accounting code because page->mapping->i_wb isn't >>> set. >>> >>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference >>> at 00000000000000d0 >>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70 >>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0 >>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC >>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted >>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70 >>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286 >>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: >>> 0000000000026115 >>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: >>> 0000000000000090 >>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: >>> 0000000000000000 >>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: >>> ffffffffffffffff >>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: >>> 0000000000000001 >>> [ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) >>> knlGS:0000000000000000 >>> [ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: >>> 0000000000360ee0 >>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>> 0000000000000000 >>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>> 0000000000000400 >>> [ 8308.594469] Call Trace: >>> [ 8308.595149] account_page_cleaned+0x15b/0x1f0 >>> [ 8308.596340] __cancel_dirty_page+0x146/0x200 >>> [ 8308.599395] truncate_cleanup_page+0x92/0xb0 >>> [ 8308.600480] truncate_inode_pages_range+0x202/0x7d0 >>> [ 8308.617392] btrfs_evict_inode+0x92/0x5a0 >>> [ 8308.619108] evict+0xc1/0x190 >>> [ 8308.620023] do_unlinkat+0x176/0x280 >>> [ 8308.621202] do_syscall_64+0x63/0x1a0 >>> [ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7 >>> >>> The fix here is to make asyc_cow->locked_page NULL everywhere but the >>> one async_cow struct that's allowed to do things to the locked page. >>> >>> Signed-off-by: Chris Mason <c...@fb.com> >>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and >>> reads") >>> Reviewed-by: Josef Bacik <jo...@toxicpanda.com> >>> --- >>> fs/btrfs/extent_io.c | 2 +- >>> fs/btrfs/inode.c | 25 +++++++++++++++++++++---- >>> 2 files changed, 22 insertions(+), 5 deletions(-) >>> >>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >>> index 5106008f5e28..a31574df06aa 100644 >>> --- a/fs/btrfs/extent_io.c >>> +++ b/fs/btrfs/extent_io.c >>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct >>> address_space *mapping, >>> if (page_ops & PAGE_SET_PRIVATE2) >>> SetPagePrivate2(pages[i]); >>> >>> - if (pages[i] == locked_page) { >>> + if (locked_page && pages[i] == locked_page) { >> >> Why not make the check just if (locked_page) then clean it up, since >> if >> __process_pages_contig is called from the owner of the page then it's >> guaranteed that the page will fall within it's range. > > I'm not convinced that every single caller of __process_pages_contig is > making sure to only send locked_page for ranges that correspond to the > locked_page. I'm not sure exactly what you're asking for though, it > looks like it would require some larger changes to the flow of that > loop. What I meant it is to simply factor out the code dealing with locked page outside of the loop and still place it inside __process_pages_contig. Also looking at the way locked_pages is passed across different call chains I arrive at: compress_file_range <-- locked page is null extent_clear_unlock_delalloc __process_pages_contig submit_compressed_extents <---- locked page is null extent_clear_unlock_delalloc __process_pages_contig btrfs_run_delalloc_range | run_delalloc_nocow cow_file_range <--- [when called from btrfs_run_delalloc_range we are all fine and dandy because it will always iterates a range which belongs to the page. So we can free the page and set it null for subsequent passes of the loop.] Looking run_delalloc_nocow I see the page is used 5 times - 2 of those, at the beginning and end of the function, are only used during error cases. The other 2 times is if cow_start is different than -1 , which happens if !nocow is true. I've yet to wrap my head around run_delalloc_nocow but I think it should also be safe to pass locked page just once. cow_file_range_async <--- always called with the correct locked page, in this case the function is called before any async chunks are going to be submitted. extent_clear_unlock_delalloc __process_pages_contig btrfs_run_delalloc_range <--- this one is called with locked_page belonging to the passed delalloc range. run_delalloc_nocow extent_clear_unlock_delalloc __process_pages_contig writepage_delalloc <-- calls find_lock_delalloc_range only if we aren't caalled from compress path and the start range always belongs to the page find_lock_delalloc_range <---- if the range is not delalloc it will retry. But that function is also called with the correct page. lock_delalloc_pages <--- ignores range which belongs only to this page __unlock_for_delaloc <--- ignores range which belongs only to this page <snip>