Re: Lockdep warning after `mdadm -S`

Nikolay Borisov Wed, 10 Jun 2020 04:00:38 -0700

On 10.06.20 г. 10:19 ч., Michał Mirosław wrote:
> Dear Developers,
> 
> I found a lockdep warning in dmesg some after doing 'mdadm -S' while
> also having btrfs mounted (light to none I/O load).  Disks under MD and
> btrfs are unrelated.

Huhz, I think that's genuine, because btrfs and md shared at least the bd_mutex 
and the workqueue. So the scenario could go something along the lines of: 

                                           
                                                                                
         
T1:                                             T2:                             
        T3:                                                     T4:             
                        T5:
 initiates process of stopping md,                                              
                                                        transaction commit 
blocked on                   User tries to (erroneously) mount mddev 
 holds bd_mutex, open_mutex,                                                    
        wants to begin a new transaction but            device_list_mutex, 
callstack #3,                but blocks on callstack at #4 since T1 
 blocks on flushing md_misc_wq,         Should begin  mmdev_delayed_delete      
        blocks on callstack #0 due to a                 holding sb_internal     
                        holds bd_open  and is holding device_list_mutex
 Callstack #6                           but never does because workqueue is     
        running transaction commit in T4
                                        blocked due to T3 being blocked         
        NB: This happens from writeback            
                                                                                
        context, ie. same as T2 workqueue                                 
                                                                                
        bd_mutex held by T1     

So T5 blocks T4, which blocks T3, which blocks the shared writeback workqueue, 
this prevents T2 from running which when done would cause T1 to unlock 
bd_mutex, 
which would unblock T5. I think this is generally possible but highly unlikely.

Also looking at the code in T5 (callstack #4 below) it seems that the same 
could happen if 
scan ioctl is sent for the mddev. Discussing this with peterz he proposed the 
following half-baked 
patch: https://paste.debian.net/1151314/

The idea is to remove the md_open mutex which would break the dependency chain 
between 
#4->#6. What do mdraid people think?
                                                                                
        
> 
> Best Regards,
> Michał Mirosław
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.7.1mq+ #383 Tainted: G           O     
> ------------------------------------------------------
> kworker/u16:3/8175 is trying to acquire lock:
> ffff8882f19556a0 (sb_internal#3){.+.+}-{0:0}, at: 
> start_transaction+0x37e/0x550 [btrfs]
> 
> but task is already holding lock:
> ffffc900087c7e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: 
> process_one_work+0x235/0x620
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
>        __flush_work+0x331/0x490
>        wb_shutdown+0x8f/0xb0
>        bdi_unregister+0x72/0x1f0
>        del_gendisk+0x2b0/0x2c0
>        md_free+0x28/0x90
>        kobject_put+0xa6/0x1b0
>        process_one_work+0x2b6/0x620
>        worker_thread+0x35/0x3e0
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #7 ((work_completion)(&mddev->del_work)){+.+.}-{0:0}:
>        process_one_work+0x28d/0x620
>        worker_thread+0x35/0x3e0
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #6 ((wq_completion)md_misc){+.+.}-{0:0}:
>        flush_workqueue+0xa9/0x4e0
>        __md_stop_writes+0x18/0x100
>        do_md_stop+0x165/0x2d0
>        md_ioctl+0xa52/0x1d60
>        blkdev_ioctl+0x1cc/0x2a0
>        block_ioctl+0x3a/0x40
>        ksys_ioctl+0x81/0xc0
>        __x64_sys_ioctl+0x11/0x20
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #5 (&mddev->open_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        md_open+0x43/0xc0
>        __blkdev_get+0xea/0x560
>        blkdev_get+0x60/0x130
>        do_dentry_open+0x147/0x3e0
>        path_openat+0x84f/0xa80
>        do_filp_open+0x8e/0x100
>        do_sys_openat2+0x225/0x2e0
>        do_sys_open+0x46/0x80
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #4 (&bdev->bd_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        __blkdev_get+0x77/0x560
>        blkdev_get+0x60/0x130
>        blkdev_get_by_path+0x41/0x80
>        btrfs_get_bdev_and_sb+0x16/0xb0 [btrfs]
>        open_fs_devices+0x9d/0x240 [btrfs]
>        btrfs_open_devices+0x89/0x90 [btrfs]
>        btrfs_mount_root+0x26a/0x4b0 [btrfs]
>        legacy_get_tree+0x2b/0x50
>        vfs_get_tree+0x23/0xc0
>        fc_mount+0x9/0x40
>        vfs_kern_mount.part.40+0x57/0x80
>        btrfs_mount+0x148/0x3f0 [btrfs]
>        legacy_get_tree+0x2b/0x50
>        vfs_get_tree+0x23/0xc0
>        do_mount+0x712/0xa40
>        __x64_sys_mount+0xbf/0xe0
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #3 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        btrfs_run_dev_stats+0x44/0x470 [btrfs]
>        commit_cowonly_roots+0xac/0x2a0 [btrfs]
>        btrfs_commit_transaction+0x511/0xa70 [btrfs]
>        transaction_kthread+0x13c/0x160 [btrfs]
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #2 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        btrfs_commit_transaction+0x4b6/0xa70 [btrfs]
>        transaction_kthread+0x13c/0x160 [btrfs]
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #1 (&fs_info->reloc_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        btrfs_record_root_in_trans+0x3e/0x60 [btrfs]
>        start_transaction+0xcb/0x550 [btrfs]
>        btrfs_mkdir+0x5c/0x1e0 [btrfs]
>        vfs_mkdir+0x107/0x1d0
>        do_mkdirat+0xe7/0x110
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #0 (sb_internal#3){.+.+}-{0:0}:
>        __lock_acquire+0x11f9/0x1aa0
>        lock_acquire+0x9e/0x380
>        __sb_start_write+0x13a/0x270
>        start_transaction+0x37e/0x550 [btrfs]
>        cow_file_range_inline.constprop.74+0xe4/0x640 [btrfs]
>        cow_file_range+0xe5/0x3f0 [btrfs]
>        btrfs_run_delalloc_range+0x128/0x620 [btrfs]
>        writepage_delalloc+0xe2/0x140 [btrfs]
>        __extent_writepage+0x1a3/0x370 [btrfs]
>        extent_write_cache_pages+0x2b8/0x470 [btrfs]
>        extent_writepages+0x3f/0x90 [btrfs]
>        do_writepages+0x3c/0xe0
>        __writeback_single_inode+0x4f/0x650
>        writeback_sb_inodes+0x1f7/0x560
>        __writeback_inodes_wb+0x58/0xa0
>        wb_writeback+0x33b/0x4b0
>        wb_workfn+0x428/0x5b0
>        process_one_work+0x2b6/0x620
>        worker_thread+0x35/0x3e0
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   sb_internal#3 --> (work_completion)(&mddev->del_work) --> 
> (work_completion)(&(&wb->dwork)->work)
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock((work_completion)(&(&wb->dwork)->work));
>                                lock((work_completion)(&mddev->del_work));
>                                lock((work_completion)(&(&wb->dwork)->work));
>   lock(sb_internal#3);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kworker/u16:3/8175:
>  #0: ffff88840baa6948 ((wq_completion)writeback){+.+.}-{0:0}, at: 
> process_one_work+0x235/0x620
>  #1: ffffc900087c7e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, 
> at: process_one_work+0x235/0x620
>  #2: ffff8882f19550e8 (&type->s_umount_key#52){++++}-{3:3}, at: 
> trylock_super+0x11/0x50
> 
> stack backtrace:
> CPU: 1 PID: 8175 Comm: kworker/u16:3 Tainted: G           O      5.7.1mq+ #383
> Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 
> 11/09/2012
> Workqueue: writeback wb_workfn (flush-btrfs-1)
> Call Trace:
>  dump_stack+0x71/0xa0
>  check_noncircular+0x165/0x180
>  ? stack_trace_save+0x46/0x70
>  __lock_acquire+0x11f9/0x1aa0
>  lock_acquire+0x9e/0x380
>  ? start_transaction+0x37e/0x550 [btrfs]
>  __sb_start_write+0x13a/0x270
>  ? start_transaction+0x37e/0x550 [btrfs]
>  start_transaction+0x37e/0x550 [btrfs]
>  ? kmem_cache_alloc+0x1b0/0x2c0
>  cow_file_range_inline.constprop.74+0xe4/0x640 [btrfs]
>  ? lock_acquire+0x9e/0x380
>  ? test_range_bit+0x3d/0x130 [btrfs]
>  cow_file_range+0xe5/0x3f0 [btrfs]
>  btrfs_run_delalloc_range+0x128/0x620 [btrfs]
>  ? find_lock_delalloc_range+0x1f3/0x220 [btrfs]
>  writepage_delalloc+0xe2/0x140 [btrfs]
>  __extent_writepage+0x1a3/0x370 [btrfs]
>  extent_write_cache_pages+0x2b8/0x470 [btrfs]
>  ? __lock_acquire+0x3fc/0x1aa0
>  extent_writepages+0x3f/0x90 [btrfs]
>  do_writepages+0x3c/0xe0
>  ? find_held_lock+0x2d/0x90
>  __writeback_single_inode+0x4f/0x650
>  writeback_sb_inodes+0x1f7/0x560
>  __writeback_inodes_wb+0x58/0xa0
>  wb_writeback+0x33b/0x4b0
>  wb_workfn+0x428/0x5b0
>  ? sched_clock_cpu+0xe/0xd0
>  process_one_work+0x2b6/0x620
>  ? worker_thread+0xc7/0x3e0
>  worker_thread+0x35/0x3e0
>  ? process_one_work+0x620/0x620
>  kthread+0x143/0x160
>  ? kthread_park+0x80/0x80
>  ret_from_fork+0x3a/0x50
>
Re: Lockdep warning after `mdadm -S`

Reply via email to