raid5.c:693!

NeilBrown Sun, 24 Nov 2013 15:37:00 -0800

On Sat, 23 Nov 2013 11:22:05 +0800 fengguang...@intel.com wrote:

> Shaohua,
> 
> FYI, we are still seeing this bug.. dmesg attached.


Thanks for the report.  However the dmesg you attached doesn't mention:

     kernel BUG at drivers/md/raid5.c:693!

at all.  It is quite different.

The "BUG" it reports is that a spinlock is blocking for too long - over
99msecs.
As far as I can tell, this is a scheduling issue, possibly specific to qemu,
rather than a bug in raid5.

md/raid5 is calling md_wakeup_thread while holding a spinlock, and that seems
to get lost in the scheduler for long enough that a different thread which
also wants the spinlock starts to complain.
It has always been safe to call 'wake_up' from under a spinlock before.

Ingo/Peter: is it considered OK to call wake_up while holding a spinlock?
Could "sleeping spinlocks" affect this at all? (some sample stack traces are
below).

Thanks,
NeilBrown

> 
> 566c09c53455d7c4f1130928ef8071da1a24ea65 is the first bad commit
> commit 566c09c53455d7c4f1130928ef8071da1a24ea65
> Author: Shaohua Li <s...@kernel.org>
> Date:   Thu Nov 14 15:16:17 2013 +1100
> 
>     raid5: relieve lock contention in get_active_stripe()
>     
>     get_active_stripe() is the last place we have lock contention. It has two
>     paths. One is stripe isn't found and new stripe is allocated, the other is
>     stripe is found.
>     
>     The first path basically calls __find_stripe and init_stripe. It accesses
>     conf->generation, conf->previous_raid_disks, conf->raid_disks,
>     conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
>     conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. 
> Except
>     stripe_hashtbl and inactive_list, other fields are changed very rarely.
>     
>     With this patch, we split inactive_list and add new hash locks. Each free
>     stripe belongs to a specific inactive list. Which inactive list is 
> determined
>     by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it 
> has a
>     lock_hash assigned. Stripe's inactive list is protected by a hash lock, 
> which
>     is determined by it's lock_hash too. The lock_hash is derivied from 
> current
>     stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be 
> assigned
>     to a specific lock_hash, so we can use new hash lock to protect 
> stripe_hashtbl
>     list too. The goal of the new hash locks introduced is we can only use 
> the new
>     locks in the first path of get_active_stripe(). Since we have several hash
>     locks, lock contention is relieved significantly.
>     
>     The first path of get_active_stripe() accesses other fields, since they 
> are
>     changed rarely, changing them now need take conf->device_lock and all hash
>     locks. For a slow path, this isn't a problem.
>     
>     If we need lock device_lock and hash lock, we always lock hash lock 
> first. The
>     tricky part is release_stripe and friends. We need take device_lock first.
>     Neil's suggestion is we put inactive stripes to a temporary list and 
> readd it
>     to inactive_list after device_lock is released. In this way, we add 
> stripes to
>     temporary list with device_lock hold and remove stripes from the list 
> with hash
>     lock hold. So we don't allow concurrent access to the temporary list, 
> which
>     means we need allocate temporary list for all participants of 
> release_stripe.
>     
>     One downside is free stripes are maintained in their inactive list, they 
> can't
>     across between the lists. By default, we have total 256 stripes and 8 
> lists, so
>     each list will have 32 stripes. It's possible one list has free stripe but
>     other list hasn't. The chance should be rare because stripes allocation 
> are
>     even distributed. And we can always allocate more stripes for cache, 
> several
>     mega bytes memory isn't a big deal.
>     
>     This completely removes the lock contention of the first path of
>     get_active_stripe(). It slows down the second code path a little bit 
> though
>     because we now need takes two locks, but since the hash lock isn't 
> contended,
>     the overhead should be quite small (several atomic instructions). The 
> second
>     path of get_active_stripe() (basically sequential write or big request 
> size
>     randwrite) still has lock contentions.
>     
>     Signed-off-by: Shaohua Li <s...@fusionio.com>
>     Signed-off-by: NeilBrown <ne...@suse.de>
> 
> :040000 040000 88fa28d1decc5454cf4d58421fa3eb12bc9ad524 
> 5d1f104188f72b17d10cd569bf2924dab5d789cb M    drivers
> bisect run success
> 
> # bad: [02ffe4cc90dce5a1bbee5daae98a40a431c29c6d] Merge 
> 'yuanhan/slub-experimental' into devel-hourly-2013112217
> # good: [5e01dc7b26d9f24f39abace5da98ccbd6a5ceb52] Linux 3.12
> git bisect start '02ffe4cc90dce5a1bbee5daae98a40a431c29c6d' 
> '5e01dc7b26d9f24f39abace5da98ccbd6a5ceb52' '--'
> # good: [5cbb3d216e2041700231bcfc383ee5f8b7fc8b74] Merge branch 'akpm' 
> (patches from Andrew Morton)
> git bisect good 5cbb3d216e2041700231bcfc383ee5f8b7fc8b74
> # good: [d5ceede8dc86278d16dcad8f916ef323b5672bd8] 
> drivers/rtc/rtc-hid-sensor-time.c: use dev_get_platdata()
> git bisect good d5ceede8dc86278d16dcad8f916ef323b5672bd8
> # good: [ffd3c0260aeeb1fd4d36378d2e06e6410661dd0f] Merge branch 'for-linus' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
> git bisect good ffd3c0260aeeb1fd4d36378d2e06e6410661dd0f
> # good: [1ee2dcc2245340cf4ac94b99c4d00efbeba61824] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> git bisect good 1ee2dcc2245340cf4ac94b99c4d00efbeba61824
> # bad: [a50f09550a7dd58b4514630fec1676855084ec1f] Merge 
> 'vireshk/cpufreq-next' into devel-hourly-2013112217
> git bisect bad a50f09550a7dd58b4514630fec1676855084ec1f
> # bad: [af2e2f328052082f58f041d574ed50c7f21c598f] Merge tag 
> 'squashfs-updates' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next
> git bisect bad af2e2f328052082f58f041d574ed50c7f21c598f
> # good: [df12a3178d340319b1955be6b973a4eb84aff754] Merge commit 
> 'dmaengine-3.13-v2' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
> git bisect good df12a3178d340319b1955be6b973a4eb84aff754
> # good: [ed6a82546d2e8f6b5902269541733814d4adacc2] Merge branch 'acpi-hotplug'
> git bisect good ed6a82546d2e8f6b5902269541733814d4adacc2
> # bad: [6d6e352c80f22c446d933ca8103e02bac1f09129] Merge tag 'md/3.13' of 
> git://neil.brown.name/md
> git bisect bad 6d6e352c80f22c446d933ca8103e02bac1f09129
> # bad: [30b8feb730f9b9b3c5de02580897da03f59b6b16] md/raid5: avoid deadlock 
> when raid5 array has unack badblocks during md_stop_writes.
> git bisect bad 30b8feb730f9b9b3c5de02580897da03f59b6b16
> # bad: [566c09c53455d7c4f1130928ef8071da1a24ea65] raid5: relieve lock 
> contention in get_active_stripe()
> git bisect bad 566c09c53455d7c4f1130928ef8071da1a24ea65
> # good: [02e5f5c0a0f726e66e3d8506ea1691e344277969] md: fix calculation of 
> stacking limits on level change.
> git bisect good 02e5f5c0a0f726e66e3d8506ea1691e344277969
> # good: [82e06c811163c4d853ed335d56c3378088bc89cc] wait: add wait_event_cmd()
> git bisect good 82e06c811163c4d853ed335d56c3378088bc89cc
> # first bad commit: [566c09c53455d7c4f1130928ef8071da1a24ea65] raid5: relieve 
> lock contention in get_active_stripe()


This process waited too long for a spinlock:


[   82.440547] BUG: spinlock lockup suspected on CPU#0, btrfs-submit-1/2938
[   82.440547]  lock: 0xffff88007f0aed20, .magic: dead4ead, .owner: 
md0_raid5/2906, .owner_cpu: 2
[   82.440547] CPU: 0 PID: 2938 Comm: btrfs-submit-1 Not tainted 
3.12.0-11088-gc710863 #1767
[   82.440547] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   82.440547]  0000000000000001 ffff8803cd1e3a10 ffffffff81a415f8 
ffff8804081e8000
[   82.440547]  ffff8803cd1e3a30 ffffffff81a3a102 ffff88007f0aed20 
00000000a08d1ad0
[   82.440547]  ffff8803cd1e3a58 ffffffff811055d4 ffff88007f0aed38 
ffff88007f0aed20
[   82.440547] Call Trace:
[   82.440547]  [<ffffffff81a415f8>] dump_stack+0x4d/0x66
[   82.440547]  [<ffffffff81a3a102>] spin_dump+0x91/0x96
[   82.440547]  [<ffffffff811055d4>] do_raw_spin_lock+0xc9/0x10e
[   82.440547]  [<ffffffff81a4c7b3>] _raw_spin_lock+0x5b/0x6d
[   82.440547]  [<ffffffff8189569b>] ? get_active_stripe+0x605/0x6a9
[   82.440547]  [<ffffffff8189569b>] get_active_stripe+0x605/0x6a9
[   82.440547]  [<ffffffff810fc068>] ? prepare_to_wait+0x6f/0x77
[   82.440547]  [<ffffffff81898fd9>] make_request+0x69a/0xac5
[   82.440547]  [<ffffffff818b58d1>] ? md_make_request+0x18b/0x2ce
[   82.440547]  [<ffffffff810fc321>] ? __wake_up_sync+0x12/0x12
[   82.440547]  [<ffffffff818b58d1>] md_make_request+0x18b/0x2ce
[   82.440547]  [<ffffffff818b57ae>] ? md_make_request+0x68/0x2ce
[   82.440547]  [<ffffffff818906e5>] ? raid5_unplug+0x111/0x120
[   82.440547]  [<ffffffff814e68bf>] generic_make_request+0x9f/0xe0
[   82.440547]  [<ffffffff814e69ea>] submit_bio+0xea/0x12a
[   82.440547]  [<ffffffff8145a449>] run_scheduled_bios+0x26c/0x455
[   82.440547]  [<ffffffff8145a647>] pending_bios_fn+0x15/0x17
[   82.440547]  [<ffffffff81461925>] worker_loop+0x15c/0x4ee
[   82.440547]  [<ffffffff814617c9>] ? btrfs_queue_worker+0x280/0x280
[   82.440547]  [<ffffffff810e328d>] kthread+0xf9/0x101
[   82.440547]  [<ffffffff810f21ae>] ? local_clock+0x41/0x5a
[   82.440547]  [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4
[   82.440547]  [<ffffffff81a5487c>] ret_from_fork+0x7c/0xb0
[   82.440547]  [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4


This process was holding it:

[   78.658626] NMI backtrace for cpu 2
[   78.658626] CPU: 2 PID: 2906 Comm: md0_raid5 Not tainted 
3.12.0-11088-gc710863 #1767
[   78.658626] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   78.658626] task: ffff8804081e8000 ti: ffff8803cd1e6000 task.ti: 
ffff8803cd1e6000
[   78.658626] RIP: 0010:[<ffffffff81067501>]  [<ffffffff81067501>] 
pvclock_clocksource_read+0x10/0xa7
[   78.658626] RSP: 0018:ffff8803cd1e7b90  EFLAGS: 00000082
[   78.658626] RAX: 0000000000000002 RBX: ffff88040ebd4ac0 RCX: ffff8803cfba57d8
[   78.658626] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88041f7e9080
[   78.658626] RBP: ffff8803cd1e7ba8 R08: 0000000000000000 R09: ffff8803cfba57d8
[   78.658626] R10: 0000000000000008 R11: ffffffff810fbf46 R12: 0000000000000000
[   78.658626] R13: 0000000000000002 R14: ffff8803cfba57a8 R15: 0000000000000046
[   78.658626] FS:  0000000000000000(0000) GS:ffff88040ea00000(0000) 
knlGS:0000000000000000
[   78.658626] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   78.658626] CR2: 00000000006f86e4 CR3: 000000007f17e000 CR4: 00000000000006e0
[   78.658626] Stack:
[   78.658626]  ffff8803cfba57d8 ffff88040ebd4ac0 0000000000000000 
ffff8803cd1e7bb8
[   78.658626]  ffffffff81066a05 ffff8803cd1e7bc8 ffffffff8103f9ce 
ffff8803cd1e7be8
[   78.658626]  ffffffff810f1f6d ffff88040ebd4ac0 00000000001d4ac0 
ffff8803cd1e7c10
[   78.658626] Call Trace:
[   78.658626]  [<ffffffff81066a05>] kvm_clock_read+0x27/0x31
[   78.658626]  [<ffffffff8103f9ce>] sched_clock+0x9/0xd
[   78.658626]  [<ffffffff810f1f6d>] sched_clock_local+0x12/0x72
[   78.658626]  [<ffffffff810f20f5>] sched_clock_cpu+0xc1/0xcc
[   78.658626]  [<ffffffff810f2197>] local_clock+0x2a/0x5a
[   78.658626]  [<ffffffff81101202>] lock_release_holdtime.part.7+0xf/0xd9
[   78.658626]  [<ffffffff810fbf68>] ? __wake_up+0x44/0x4c
[   78.658626]  [<ffffffff811037b0>] lock_release+0x185/0x1ed
[   78.658626]  [<ffffffff81a4cb3e>] _raw_spin_unlock_irqrestore+0x24/0x5e
[   78.658626]  [<ffffffff810fbf68>] __wake_up+0x44/0x4c
[   78.658626]  [<ffffffff818b86f8>] md_wakeup_thread+0x55/0x58
[   78.658626]  [<ffffffff8188e8eb>] raid5_wakeup_stripe_thread+0xab/0x169
[   78.658626]  [<ffffffff81897eba>] raid5d+0x1ac/0x4a8
[   78.658626]  [<ffffffff811012bf>] ? lock_release_holdtime.part.7+0xcc/0xd9
[   78.658626]  [<ffffffff818b8c9c>] md_thread+0x10f/0x125
[   78.658626]  [<ffffffff810fc321>] ? __wake_up_sync+0x12/0x12
[   78.658626]  [<ffffffff818b8b8d>] ? md_register_thread+0xd5/0xd5
[   78.658626]  [<ffffffff810e328d>] kthread+0xf9/0x101
[   78.658626]  [<ffffffff810f21ae>] ? local_clock+0x41/0x5a
[   78.658626]  [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4
[   78.658626]  [<ffffffff81a5487c>] ret_from_fork+0x7c/0xb0
[   78.658626]  [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4
[   78.658626] Code: 47 1d 66 66 90 0f ae e8 8b 17 f6 c2 01 75 e6 39 d1 75 e2 
22 05 ad b8 2e 01 5d c3 55 48 89 e5 41 54 53 51 8b 37 66 66 90 0f ae e8 <0f> 31 
48 c1 e2 20 89 c0 0f be 4f 1c 48 09 c2 48 2b 57 08 44 8b 
[   78.658626] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took 
too long to run: 99.903 msecs

signature.asc
Description: PGP signature

Re: kernel BUG at drivers/md/raid5.c:693!

Reply via email to