On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote: > On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki wrote: >> On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki wrote: >>> On Tue, Sep 29, 2020 at 05:27:48PM +0200, Jürgen Groß wrote: >>>> On 29.09.20 17:16, Marek Marczykowski-Górecki wrote: >>>>> On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote: >>>>>> On 29.09.20 16:27, Marek Marczykowski-Górecki wrote: >>>>>>> On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki >>>>>>> wrote: >>>>>>>> On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote: >>>>>>>>> [Adding Juergen] >>>>>>>>> >>>>>>>>> On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote: >>>>>>>>>> On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote: >>>>>>>>>>> On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> In my test setup (inside KVM with nested virt enabled), I rather >>>>>>>>>>>> frequently get Xen crash on resume from S3. Full message below. >>>>>>>>>>>> >>>>>>>>>>>> This is Xen 4.13.0, with some patches, including "sched: fix >>>>>>>>>>>> resuming >>>>>>>>>>>> from S3 with smt=0". >>>>>>>>>>>> >>>>>>>>>>>> Contrary to the previous issue, this one does not happen always - >>>>>>>>>>>> I >>>>>>>>>>>> would say in about 40% cases on this setup, but very rarely on >>>>>>>>>>>> physical >>>>>>>>>>>> setup. >>>>>>>>>>>> >>>>>>>>>>>> This is _without_ core scheduling enabled, and also with smt=off. >>>>>>>>>>>> >>>>>>>>>>>> Do you think it would be any different on xen-unstable? I cat >>>>>>>>>>>> try, but >>>>>>>>>>>> it isn't trivial in this setup, so I'd ask first. >>>>>>>>>>>> >>>>>>>>> Well, Juergen has fixed quite a few issues. >>>>>>>>> >>>>>>>>> Most of them where triggering with core-scheduling enabled, and I >>>>>>>>> don't >>>>>>>>> recall any of them which looked similar or related to this. >>>>>>>>> >>>>>>>>> Still, it's possible that the same issue causes different symptoms, >>>>>>>>> and >>>>>>>>> hence that maybe one of the patches would fix this too. >>>>>>>> >>>>>>>> I've tested on master (d094e95fb7c), and reproduced exactly the same >>>>>>>> crash >>>>>>>> (pasted below for the completeness). >>>>>>>> But there is more: additionally, in most (all?) cases after resume >>>>>>>> I've got >>>>>>>> soft lockup in Linux dom0 in smp_call_function_single() - see below. It >>>>>>>> didn't happened before and the only change was Xen 4.13 -> master. >>>>>>>> >>>>>>>> Xen crash: >>>>>>>> >>>>>>>> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at >>>>>>>> credit2.c:2133 >>>>>>> >>>>>>> Juergen, any idea about this one? This is also happening on the current >>>>>>> stable-4.14 (28855ebcdbfa). >>>>>>> >>>>>> >>>>>> Oh, sorry I didn't come back to this issue. >>>>>> >>>>>> I suspect this is related to stop_machine_run() being called during >>>>>> suspend(), as I'm seeing very sporadic issues when offlining and then >>>>>> onlining cpus with core scheduling being active (it seems as if the >>>>>> dom0 vcpu doing the cpu online activity sometimes is using an old >>>>>> vcpu state). >>>>> >>>>> Note this is default Xen 4.14 start, so core scheduling is _not_ active: >>>> >>>> The similarity in the two failure cases is that multiple cpus are >>>> affected by the operations during stop_machine_run(). >>>> >>>>> >>>>> (XEN) Brought up 2 CPUs >>>>> (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource >>>>> (XEN) Adding cpu 0 to runqueue 0 >>>>> (XEN) First cpu on runqueue, activating >>>>> (XEN) Adding cpu 1 to runqueue 1 >>>>> (XEN) First cpu on runqueue, activating >>>>> >>>>>> I wasn't able to catch the real problem despite of having tried lots >>>>>> of approaches using debug patches. >>>>>> >>>>>> Recently I suspected the whole problem could be somehow related to >>>>>> RCU handling, as stop_machine_run() is relying on tasklets which are >>>>>> executing in idle context, and RCU handling is done in idle context, >>>>>> too. So there might be some kind of use after free scenario in case >>>>>> some memory is freed via RCU despite it still being used by a tasklet. >>>>> >>>>> That sounds plausible, even though I don't really know this area of Xen. >>>>> >>>>>> I "just" need to find some time to verify this suspicion. Any help doing >>>>>> this would be appreciated. :-) >>>>> >>>>> I do have a setup where I can easily-ish reproduce the issue. If there >>>>> is some debug patch you'd like me to try, I can do that. >>>> >>>> Thanks. I might come back to that offer as you are seeing a crash which >>>> will be much easier to analyze. Catching my error case is much harder as >>>> it surfaces some time after the real problem in a non destructive way >>>> (usually I'm seeing a failure to load a library in the program which >>>> just did its job via exactly the library claiming not being loadable). >>> >>> Hi, >>> >>> I'm resurrecting this thread as it was recently mentioned elsewhere. I >>> can still reproduce the issue on the recent staging branch (9dc687f155). >>> >>> It fails after the first resume (not always, but frequent enough to >>> debug it). At least one guest needs to be running - with just (PV) dom0 >>> the crash doesn't happen (at least for the ~8 times in a row I tried). >>> If the first resume works, the second (almost?) always will fail but >>> with a different symptoms - dom0 kernel lockups (at least some of its >>> vcpus). I haven't debugged this one yet at all. >>> >>> Any help will be appreciated, I can apply some debug patches, change >>> configuration etc. >> >> This still happens on 4.14.3. Maybe it is related to freeing percpu >> areas, as it caused other issues with suspend too? Just a thought... > > I have reproduced this on current staging(*). And I can reproduce it > reliably. And also, I got (I believe) closely related crash with credit1 > scheduler. > > (*) It isn't plain staging, it's one with my xhci console patches on > top, including attempt to make it survive S3. I believe the only > relevant part there is sticking set_timer() into console resume path (or > just having a timer with rather short delay registered). The actual tree > at https://github.com/marmarek/xen/tree/master-xue2-debug, including > quite a lot of debug prints and debug hacks. > > Specific crash with credit2:
Are you sure this is Credit2? Both ... > (XEN) Assertion 'sched_unit_master(currunit) == cpu' failed at > common/sched/credit.c:928 ... here and ... > (XEN) ----[ Xen-4.17-unstable x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82d0402434bf>] credit.c#csched_tick+0x2d4/0x494 > (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor (d0v4) > (XEN) rax: ffff82d0405c4298 rbx: 0000000000000002 rcx: > 0000000000000002 > (XEN) rdx: ffff8302517f64d0 rsi: ffff8302515c0fc0 rdi: > 0000000000000002 > (XEN) rbp: ffff830256227e38 rsp: ffff830256227de0 r8: > 0000000000000004 > (XEN) r9: ffff8302517ac820 r10: ffff830251745068 r11: > 00000088cb734887 > (XEN) r12: ffff83025174de50 r13: ffff8302515c0fa0 r14: > ffff83025174df40 > (XEN) r15: ffff8302515c0cc0 cr0: 0000000080050033 cr4: > 0000000000372660 > (XEN) cr3: 00000001bacbd000 cr2: 000077e5ec02a318 > (XEN) fsb: 000077e5fe533700 gsb: ffff888255700000 gss: > 0000000000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen code around <ffff82d0402434bf> > (credit.c#csched_tick+0x2d4/0x494): > (XEN) 01 00 00 e9 2a 01 00 00 <0f> 0b 0f 0b 0f 0b 48 8b 41 20 0f b7 00 > 89 45 cc > (XEN) Xen stack trace from rsp=ffff830256227de0: > (XEN) ffff830256227fff 0000000000000000 0000000256227e10 > ffff82d04035be90 > (XEN) ffff830256227ef8 ffff830251745000 ffff82d0405c3280 > ffff82d0402431eb > (XEN) 0000000000000002 00000088c9ba9534 0000000000000000 > ffff830256227e60 > (XEN) ffff82d04022ee53 ffff82d0405c3280 ffff8302963e1320 > ffff8302515c0fc0 > (XEN) ffff830256227ea0 ffff82d04022f73f ffff830256227e80 > ffff82d0405c9f00 > (XEN) ffffffffffffffff ffff82d0405c9f00 ffff830256227fff > 0000000000000000 > (XEN) ffff830256227ed8 ffff82d04022d26c ffff830251745000 > 0000000000000000 > (XEN) 0000000000000000 ffff830256227fff 0000000000000000 > ffff830256227ee8 > (XEN) ffff82d04022d2ff 00007cfda9dd80e7 ffff82d0402f03c6 > ffff88810c005c00 > (XEN) 0000000000000031 0000000000000100 00000000fffffe00 > 0000000000000031 > (XEN) 0000000000000031 ffffffff82d45d28 0000000000000e2e > 0000000000000000 > (XEN) 0000000000000032 00000000ffffef31 0000000000000000 > ffff88812244a700 > (XEN) 0000000000000005 ffff88812244a780 000000fa00000000 > ffffffff818db55f > (XEN) 000000000000e033 0000000000000246 ffffc900409b7c50 > 000000000000e02b > (XEN) 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 > (XEN) 0000e01000000000 ffff830251745000 0000000000000000 > 0000000000372660 > (XEN) 0000000000000000 800000025620b002 000e030300000001 > 0000000000000000 > (XEN) Xen call trace: > (XEN) [<ffff82d0402434bf>] R credit.c#csched_tick+0x2d4/0x494 > (XEN) [<ffff82d04022ee53>] F timer.c#execute_timer+0x45/0x5c > (XEN) [<ffff82d04022f73f>] F timer.c#timer_softirq_action+0x71/0x278 > (XEN) [<ffff82d04022d26c>] F softirq.c#__do_softirq+0x94/0xbe > (XEN) [<ffff82d04022d2ff>] F do_softirq+0x13/0x15 > (XEN) [<ffff82d0402f03c6>] F x86_64/entry.S#process_softirqs+0x6/0x20 ... here the only references are to credit.c, i.e. Credit1 code. Jan