> If you're crashing the box my guess would be there's a bug in the cake > qdisc somewhere. What happens if you run SQM with fq_codel instead?
I switched over to fq_codel + simple.qos two days ago. First, the 'frequently' appearing errors were gone - all seemed fine. However, after ~30h and ~4GB of traffic, I got one single event again: > [140116.830000] INFO: rcu_sched self-detected stall on CPU > [140116.830000] INFO: rcu_sched detected stalls on CPUs/tasks: > [140116.830000] 0-...: (1 GPs behind) idle=101/2/0 > softirq=766722/766730 fqs=0 > [140116.830000] > [140116.830000] (detected by 1, t=14267 jiffies, g=285924, c=285923, q=448) > [140116.830000] Task dump for CPU 0: > [140116.830000] swapper/0 R > [140116.830000] running task 0 0 0 0x00100000 > Stack : > [140116.830000] 804affe0 00000400 00000000 7ac136ec 00007f6e ffffffff > 00007009 771202c0 > [140116.830000] 804bb48c 00000001 8045bca0 804c0000 00000001 8ffc39dc > 00000000 00000000 > [140116.830000] 00000000 8000c1cc 11000403 00000003 804ae000 804afea8 > bfbf0000 80062e74 > [140116.830000] 11000403 00000003 00000001 804c0000 d0800400 8000c1e4 > 80520000 804c0000 > [140116.830000] 80520000 803c5fcc 80520000 804c0000 80520000 80505ce4 > 80520000 804dfbe4 > [140116.830000] ...Call Trace: > [140116.830000] [<803c7c98>] __schedule+0x5d4/0x7a4 > [140116.830000] [<8000c1cc>] r4k_wait_irqoff+0x0/0x20 > [140116.830000] rcu_sched kthread starved for 14267 jiffies! g285924 c285923 > f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 > [140116.830000] rcu_sched S > [140116.830000] 0 7 2 0x00100000 > Stack : > [140116.830000] 804bb5f4 8fc52340 81235bc0 00000000 81235bc0 00000000 > 81235bc0 8050fbc0 > [140116.830000] 8121c320 00d52039 8121c320 8fc6be50 804c0000 00000001 > 804c0000 804c0000 > [140116.830000] 804c35b0 803c7ed4 00d52039 804c0000 8fc6be50 8121c320 > 00d52039 803ca838 > [140116.830000] 804bb5f4 00000001 804c3480 804c35b0 804c0000 00000001 > 00000000 8121c460 > [140116.830000] 00d52039 8007b964 8fc52340 0e800001 804c3480 00000001 > 804c0000 00000000 > [140116.830000] ...Call Trace: > [140116.830000] [<803c7c98>] __schedule+0x5d4/0x7a4 > [140116.830000] [<803c7ed4>] schedule+0x6c/0x84 > [140116.830000] [<803ca838>] schedule_timeout+0x160/0x19c > [140116.830000] [<80078ea0>] rcu_gp_kthread+0x7f4/0x7fc > [140116.830000] [<80044b98>] kthread+0xd8/0xec > [140116.830000] [<8000a318>] ret_from_kernel_thread+0x14/0x1c > [140116.830000] 0-...: (1 GPs behind) idle=101/2/0 > softirq=766722/766730 fqs=1 > [140116.830000] (t=14267 jiffies g=285924 c=285923 q=448) > [140116.830000] Task dump for CPU 0: > [140116.830000] swapper/0 R running task 0 0 0 > 0x00100004 > [140116.830000] Stack : 00000000 800694d0 00000000 00000000 00000000 800694d0 > 0000001d 00000006 > [140116.830000] 00000006 804c0000 00000000 00000000 00000000 00000000 > 00000000 80520000 > [140116.830000] 00000000 804bdea0 804bb490 804c0000 804bb490 00000000 > 00000000 00000000 > [140116.830000] 00000000 00000000 00000000 00000000 00000000 00000000 > 00000000 00000000 > [140116.830000] 00000000 00000000 00000000 00000000 00000000 0004091f > 00000000 804bdea0 > [140116.830000] ... > [140116.830000] Call Trace: > [140116.830000] [<8000f640>] show_stack+0x50/0x84 > [140116.830000] [<800a40b0>] rcu_dump_cpu_stacks+0xdc/0x110 > [140116.830000] [<8007981c>] rcu_check_callbacks+0x2cc/0x7c4 > [140116.830000] [<8007bd60>] update_process_times+0x34/0x70 > [140116.830000] [<8008c6a8>] tick_sched_timer+0x238/0x2a0 > [140116.830000] [<8007cbec>] __hrtimer_run_queues+0x10c/0x1d4 > [140116.830000] [<8007ce3c>] hrtimer_interrupt+0xec/0x2ac > [140116.830000] [<802afd5c>] gic_compare_interrupt+0x2c/0x40 > [140116.830000] [<8006fa90>] handle_percpu_devid_irq+0xc4/0x18c > [140116.830000] [<8006a7bc>] generic_handle_irq+0x24/0x3c > [140116.830000] [<802039d8>] gic_handle_local_int+0x94/0xd4 > [140116.830000] [<80203b94>] gic_irq_dispatch+0x10/0x20 > [140116.830000] [<8006a7bc>] generic_handle_irq+0x24/0x3c > [140116.830000] [<8000c2c8>] do_IRQ+0x1c/0x34 > [140116.830000] [<80202c80>] plat_irq_dispatch+0xb4/0xdc > [140116.830000] [<8000a820>] except_vec_vi_end+0xb4/0xc0 @Paul: yeah, FS#764 really seems to be related. Same CPU there. >From what I've seen on the device here: both, 4.4.71 and 4.9.31 are affected. With cake it happens rather frequently, with fq_codel once every two days or so. I'll keep it running with fq_codel and see when the next error will be triggered. Regards, P. Wassi > > -Toke > _______________________________________________ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev