On Tue, Mar 05, 2013 at 05:02:37PM +0800, Daniel J Blueman wrote: > On 02/26/2013 12:32 AM, Paul E. McKenney wrote: > >On Wed, Feb 20, 2013 at 11:35:57AM +0800, Daniel J Blueman wrote: > >>On 20/02/2013 02:16, Paul E. McKenney wrote: > >>>On Wed, Feb 20, 2013 at 12:34:12AM +0800, Daniel J Blueman wrote: > >>>>Hi Paul, > >>>> > >>>>On some of our larger servers with many hundreds of cores and when > >>>>under high duress, we can see scheduler RCU stall warnings [1], so > >>>>find we have to increase the hardcoded RCU_STALL_RAT_DELAY up from 2 > >>>>and RCU_JIFFIES_TILL_FORCE_QS up from 3. > > > >Disabling RCU_FAST_NO_HZ will likely remove the need to adjust > >RCU_JIFFIES_TILL_FORCE_QS. Changes in my -rcu tree will likely remove the > >need to adjust these two in 3.10 or 3.11, depending on how testing goes. > > > >>>>Is there a more sustainable way to account for this to avoid it > >>>>being hard-coded, such as making it and dependent timeouts a > >>>>fraction of CONFIG_RCU_CPU_STALL_TIMEOUT? > > > >Maybe... But what this means is that your system is so heavily loaded > >that the CPU in question is failing to make it to RCU's softirq handler > >in two jiffies worth of time. This is a function of workload rather > >than of the number of CPUs. > > > >>>>On the other hand, perhaps this is just caused by clock jitter (eg > >>>>due to distance from a contended clock source)? So increasing these > >>>>a bit may just be adequate in general... > >>> > >>>Hmmm... What version of the kernel are you running? > >> > >>The example below occurs with v3.8, but we see the same with > >>previous kernels eg v3.5. > > > >There is always the rcutree.rcu_cpu_stall_timeout parameter that sets > >the stall timeout in seconds. This may be specified at boot time or > >via sysfs at runtime. The default is now 21 seconds. > > > >>Of course, when using the local TSC, you'd see no jitter relative to > >>coherent transactions (eg memory writes), but when the HPET is used > >>across a large system, coherent transactions to distant cores are > >>just so much faster, as there's massive congestion to the shared > >>HPET behind various HT and PCIe bridges. This could be where the > >>jitter arises from, if I'm guessing jitter is the problem here. > > > >Agreed, timing jitter could cause problems. That said, the code uses > >the jiffies counter to compute these timings. Are you seeing similar > >memory contention on the jiffies counter itself? > > The contention we see in general are when cores contend for a > spinlock and when there are lots of concurrent HPET reads (Opterons > allow only 4 outstanding reads to the IO hub).
Seems to me that the RCU CPU stalls are correctly detecting that your platform has a problem, namely excessive contention on your HPET hardware. > It's probably possible to reproduce rcu_sched stalls on a > quad-socket box with 64 cores and the right workload with the TSC > disabled. > > In 3.9-rc1 with RCU_FAST_NO_HZ disabled, we've seen stalls of 4 > jiffies [2], but without the "Stall ended" message. This is with a > workload which allocates ~256GB of memory over 192 cores. Right, the "stall ended" messages appear only when one CPU detects another's stall. In your case below, there is a self-detected stall, and it does not check. So I could add a similar check in the self-detect case, if that is what you are getting at. Thanx, Paul > Thanks, > Daniel > > >>>>--- [1] > >>>> > >>>>[ 3939.010085] INFO: rcu_sched detected stalls on CPUs/tasks: {} > >>>>(detected by 1, t=29662 jiffies, g=3053, c=3052, q=598) > >>>>[ 3939.020008] INFO: Stall ended before state dump start > > --- [2] > > [10660.110620] INFO: rcu_sched self-detected stall on CPU { 39} > (t=4 jiffies g=1169 c=1168 q=8) > [10660.110620] Pid: 11747, comm: sp.B Not tainted 3.9.0-rc1-advanced #6 > [10660.110620] Call Trace: > [10660.110620] <IRQ> [<ffffffff810b2a42>] ? > rcu_check_callbacks+0x2d2/0x5f0 > [10660.110620] [<ffffffff8107d94a>] ? run_posix_cpu_timers+0x3a/0x790 > [10660.110620] [<ffffffff8106c86f>] ? update_process_times+0x3f/0x80 > [10660.110620] [<ffffffff81098280>] ? tick_sched_handle.isra.8+0x30/0x40 > [10660.110620] [<ffffffff810983b2>] ? tick_sched_timer+0x42/0x70 > [10660.110620] [<ffffffff8107e66a>] ? __run_hrtimer.isra.30+0x4a/0xe0 > [10660.110620] [<ffffffff8107ef45>] ? hrtimer_interrupt+0xe5/0x220 > [10660.110620] [<ffffffff8104c5a3>] ? smp_apic_timer_interrupt+0x63/0xa0 > [10660.110620] [<ffffffff8186e887>] ? apic_timer_interrupt+0x67/0x70 > -- > Daniel J Blueman > Principal Software Engineer, Numascale Asia > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/