On 01/09/16 11:48, Alexey Kardashevskiy wrote: > On 31/08/16 17:28, Peter Zijlstra wrote: >> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote: >>> On 30/08/16 22:19, Peter Zijlstra wrote: >>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote: >>>>> >>>>> >>>>> The origin of the issue I've seen seems to be related to >>>>> rwsem spin lock stealing. Basically I see the system deadlock'd in the >>>>> following state >>>> >>>> As Nick says (good to see you're back Nick!), this is unrelated to >>>> rwsems. >>>> >>>> This is true for pretty much every blocking wait loop out there, they >>>> all do: >>>> >>>> for (;;) { >>>> current->state = UNINTERRUPTIBLE; >>>> smp_mb(); >>>> if (cond) >>>> break; >>>> schedule(); >>>> } >>>> current->state = RUNNING; >>>> >>>> Which, if the wakeup is spurious, is just the pattern you need. >>> >>> Yes True! My bad Alexey had seen the same basic pattern, I should have been >>> clearer >>> in my commit log. Should I resend the patch? >> >> Yes please. >> >>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks >>>> to PPC, is _not_ MB. It is however sufficient for this case. >>>> >>> >>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a >>> different thread. >> >> Right, although even without that, there is sufficient ordering, as the >> rq unlock from the wakeup, coupled with the rq lock from the schedule >> already form a load-store barrier. >> >>>> Now, this has been present for a fair while, I suspect ever since we >>>> reworked the wakeup path to not use rq->lock twice. Curious you only now >>>> hit it. >>>> >>> >>> Yes, I just hit it a a week or two back and I needed to collect data to >>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me >>> I needed a system with large threads and less memory running stress-ng. >>> Reproducing the problem takes an unpredictable amount of time. >> >> What hardware do you see this on, is it shiny new Power8 chips which >> have never before seen deep queues or something. Or is it 'regular' old >> Power7 like stuff? > > I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net > devices with vhost enabled, all virtio-net devices are connected to the > same virtual bridge on the host (via /dev/tap*) and are doing lots of > trafic, just between these 2 guests. > > I remember doing the same test on POWER7 more than 2 years ago and finding > missing barriers in virtio but nothing like this one. But POWER7 is > seriously slower than POWER8 so it seems that nobody bothered with loading > it that much. > > I wonder how to reproduce the bug quicker as sometime it works days with no > fault but sometime it fails within first 30 minutes (backtraces from 2 > stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset, > type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to > return "false" so ttwu_queue() would always go via ttwu_queue_remote() path > but this did not make any difference.
Below are the backtraces of 2 stuck cpus (they could not be stopped in xmon as others, got these via bmc). I am also adding Michael in cc:, just in case... 0:mon> t 0xc000000fd4ab79b0 [c000000fd4ab79b0] c0000000001330c4 do_raw_spin_lock+0x1f4/0x260 [c000000fd4ab79f0] c000000000a368e8 _raw_spin_lock_irqsave+0x98/0xd0 [c000000fd4ab7a30] c00000000011ecbc remove_wait_queue+0x2c/0x70 [c000000fd4ab7a70] d0000000166e0364 vhost_poll_stop+0x34/0x60 [vhost] [c000000fd4ab7aa0] d0000000167a08c8 handle_rx+0x168/0x8d0 [vhost_net] [c000000fd4ab7c80] d0000000166e04e0 vhost_worker+0x120/0x1d0 [vhost] [c000000fd4ab7d00] c0000000000ebad8 kthread+0x118/0x140 [c000000fd4ab7e30] c0000000000098e8 ret_from_kernel_thread+0x5c/0x74 0:mon> t 0xc000001fff41b5a0 [c000001fff41b5a0] c0000000000fc3ac try_to_wake_up+0x5c/0x6c0 [c000001fff41b630] d0000000166e0fcc vhost_work_queue+0x6c/0x90 [vhost] [c000001fff41b660] d0000000166e205c vhost_poll_wakeup+0x3c/0x50 [vhost] [c000001fff41b680] c00000000011e6d4 __wake_up_common+0x84/0xf0 [c000001fff41b6e0] c00000000011ee98 __wake_up_sync_key+0x68/0xa0 [c000001fff41b730] c0000000008defd0 sock_def_readable+0x80/0x180 [c000001fff41b760] d0000000167438ac tun_net_xmit+0x52c/0x5d0 [tun] [c000001fff41b7b0] c000000000906dd8 dev_hard_start_xmit+0x178/0x450 [c000001fff41b870] c000000000938d6c sch_direct_xmit+0x11c/0x250 [c000001fff41b910] c000000000907840 __dev_queue_xmit+0x560/0x8c0 [c000001fff41b9c0] d000000016553c80 br_dev_queue_push_xmit+0xb0/0x250 [bridge] [c000001fff41ba00] d000000016553e54 br_forward_finish+0x34/0xe0 [bridge] [c000001fff41ba80] d000000016554044 __br_forward.isra.0+0x144/0x220 [bridge] [c000001fff41baf0] d000000016555e78 br_handle_frame_finish+0x148/0x570 [bridge] [c000001fff41bb80] d00000001655654c br_handle_frame+0x24c/0x400 [bridge] [c000001fff41bc10] c0000000008fed28 __netif_receive_skb_core+0x498/0xc60 [c000001fff41bcf0] c000000000904314 process_backlog+0x114/0x220 [c000001fff41bd60] c00000000090343c net_rx_action+0x36c/0x520 [c000001fff41be70] c0000000000c5068 __do_softirq+0x258/0x5b0 [c000001fff41bf90] c000000000028bd0 call_do_softirq+0x14/0x24 [c000000fc675b950] c0000000000155f8 do_softirq_own_stack+0x58/0xa0 [c000000fc675b990] c0000000000c4cb4 do_softirq.part.3+0x74/0xa0 [c000000fc675b9c0] c0000000008fe674 netif_rx_ni+0x1d4/0x230 [c000000fc675ba00] d0000000167415f8 tun_get_user+0x408/0xbc0 [tun] [c000000fc675baf0] d000000016741e04 tun_sendmsg+0x54/0x90 [tun] [c000000fc675bb30] d0000000167a1f68 handle_tx+0x298/0x5f0 [vhost_net] [c000000fc675bc80] d0000000166e04e0 vhost_worker+0x120/0x1d0 [vhost] [c000000fc675bd00] c0000000000ebad8 kthread+0x118/0x140 [c000000fc675be30] c0000000000098e8 ret_from_kernel_thread+0x5c/0x74 0:mon> -- Alexey