Le 10/26/20 à 6:54 PM, Dario Faggioli a écrit :
On Mon, 2020-10-26 at 17:11 +0100, Frédéric Pierret wrote:
Le 10/26/20 à 2:54 PM, Andrew Cooper a écrit :
If anyone would have any idea of what's going on, that would be
very
appreciated. Thank you.

Does booting Xen with `sched=credit` make a difference?

~Andrew

Thank you Andrew. Since your mail I'm currently testing this on
production and it's clearly more stable than this morning. I will not
say yet it's solved because yesterday I had some few hours of
stability too. but clearly, it's encouraging because this morning it
was just hell every 15/30 minutes.

Ok, yes, let us know if the credit scheduler seems to not suffer from
the issue.


Yes unfortunately, I had few hours of stability but it just end up to:

```
[15883.967829] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[15883.967868] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=14879
[15883.967884]  (detected by 0, t=60002 jiffies, g=460221, q=89)
[15883.967901] Sending NMI from CPU 0 to CPUs 12:
[15893.970590] rcu: rcu_sched kthread starved for 9994 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=9
[15893.970622] rcu: RCU grace-period kthread stack dump:
[15893.970631] rcu_sched       R  running task        0    10      2 0x80004008
[15893.970645] Call Trace:
[15893.970658]  ? xen_hypercall_xen_version+0xa/0x20
[15893.970670]  ? xen_force_evtchn_callback+0x9/0x10
[15893.970679]  ? check_events+0x12/0x20
[15893.970687]  ? xen_restore_fl_direct+0x1f/0x20
[15893.970697]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[15893.970708]  ? force_qs_rnp+0x6f/0x170
[15893.970715]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[15893.970724]  ? rcu_gp_fqs_loop+0x234/0x2a0
[15893.970732]  ? rcu_gp_kthread+0xb5/0x140
[15893.970740]  ? rcu_gp_init+0x470/0x470
[15893.970748]  ? kthread+0x115/0x140
[15893.970756]  ? __kthread_bind_mask+0x60/0x60
[15893.970764]  ? ret_from_fork+0x35/0x40
[16063.972793] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16063.972825] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=57364
[16063.972840]  (detected by 5, t=240007 jiffies, g=460221, q=6439)
[16063.972855] Sending NMI from CPU 5 to CPUs 12:
[16243.977769] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16243.977802] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=99504
[16243.977817]  (detected by 11, t=420012 jiffies, g=460221, q=6710)
[16243.977830] Sending NMI from CPU 11 to CPUs 12:
[16253.980496] rcu: rcu_sched kthread starved for 10001 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=9
[16253.980528] rcu: RCU grace-period kthread stack dump:
[16253.980537] rcu_sched       R  running task        0    10      2 0x80004008
[16253.980550] Call Trace:
[16253.980563]  ? xen_hypercall_xen_version+0xa/0x20
[16253.980575]  ? xen_force_evtchn_callback+0x9/0x10
[16253.980584]  ? check_events+0x12/0x20
[16253.980592]  ? xen_restore_fl_direct+0x1f/0x20
[16253.980602]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[16253.980613]  ? force_qs_rnp+0x6f/0x170
[16253.980620]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[16253.980629]  ? rcu_gp_fqs_loop+0x234/0x2a0
[16253.980637]  ? rcu_gp_kthread+0xb5/0x140
[16253.980645]  ? rcu_gp_init+0x470/0x470
[16253.980653]  ? kthread+0x115/0x140
[16253.980661]  ? __kthread_bind_mask+0x60/0x60
[16253.980669]  ? ret_from_fork+0x35/0x40
[16423.982735] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16423.982789] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=139435
[16423.982820]  (detected by 10, t=600017 jiffies, g=460221, q=7354)
[16423.982842] Sending NMI from CPU 10 to CPUs 12:
[16433.984844] rcu: rcu_sched kthread starved for 10001 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=3
[16433.984875] rcu: RCU grace-period kthread stack dump:
[16433.984885] rcu_sched       R  running task        0    10      2 0x80004000
[16433.984897] Call Trace:
[16433.984910]  ? xen_hypercall_xen_version+0xa/0x20
[16433.984922]  ? xen_force_evtchn_callback+0x9/0x10
[16433.984931]  ? check_events+0x12/0x20
[16433.984939]  ? xen_restore_fl_direct+0x1f/0x20
[16433.984949]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[16433.984960]  ? force_qs_rnp+0x6f/0x170
[16433.984967]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[16433.984976]  ? rcu_gp_fqs_loop+0x234/0x2a0
[16433.984984]  ? rcu_gp_kthread+0xb5/0x140
[16433.984992]  ? rcu_gp_init+0x470/0x470
[16433.985000]  ? kthread+0x115/0x140
[16433.985007]  ? __kthread_bind_mask+0x60/0x60
[16433.985015]  ? ret_from_fork+0x35/0x40
[16603.987677] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16603.987710] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=179313
[16603.987725]  (detected by 0, t=780022 jiffies, g=460221, q=7869)
[16603.987740] Sending NMI from CPU 0 to CPUs 12:
[16783.992658] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16783.992710] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=219106
[16783.992741]  (detected by 13, t=960027 jiffies, g=460221, q=8300)
[16783.992768] Sending NMI from CPU 13 to CPUs 12:
[16793.995873] rcu: rcu_sched kthread starved for 10000 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=4
[16793.995906] rcu: RCU grace-period kthread stack dump:
[16793.995915] rcu_sched       R  running task        0    10      2 0x80004000
[16793.995930] Call Trace:
[16793.995948]  ? xen_hypercall_xen_version+0xa/0x20
[16793.995963]  ? xen_force_evtchn_callback+0x9/0x10
[16793.995972]  ? check_events+0x12/0x20
[16793.995979]  ? xen_restore_fl_direct+0x1f/0x20
[16793.995992]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[16793.996004]  ? force_qs_rnp+0x6f/0x170
[16793.996012]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[16793.996021]  ? rcu_gp_fqs_loop+0x234/0x2a0
[16793.996029]  ? rcu_gp_kthread+0xb5/0x140
[16793.996037]  ? rcu_gp_init+0x470/0x470
[16793.996046]  ? kthread+0x115/0x140
[16793.996054]  ? __kthread_bind_mask+0x60/0x60
[16793.996062]  ? ret_from_fork+0x35/0x40
```

I'm curious about another thing, though. You mentioned, in your
previous email (and in the subject :-)) that this is a 4.13 -> 4.14
issue for you?

This is indeed happening since I've updated xen-4.14 from 4.13 and 4.13 was 
totally stable for me. Server was running for months without any issue.
Does that mean that the problem was not there on 4.13?

I'm asking because Credit2 was already the default scheduler in 4.13.

So, unless you were configuring things differently, you were already
using it there.

Normally, there is a new custom patch for S3 resume from Marek (in CC) and he 
would be much more able than me to precise some very specific changes with 
respect to 4.13.

If this is the case, it would hint at the fact that something that
changed between .13 and .14 could be the cause.

Regards


Thank you again for your help.

Attachment: OpenPGP_0x484010B5CDC576E2.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to