On Thu, Jul 7, 2022 at 11:24 PM G.R. <firemet...@users.sourceforge.net> wrote:
>
> On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeul...@suse.com> wrote:
> >
> > > Should I expect a debug build of XEN hypervisor to give better
> > > diagnose messages, without the debug patch that Roger mentioned?
> >
> > Well, "expect" is perhaps too much to say, but with problems like
> > yours (and even more so with multiple ones) using a debug
> > hypervisor (or kernel, if there such a build mode existed) is imo
> > always a good idea. As is using as up-to-date a version as
> > possible.
>
> I built both 4.14.3 debug version and 4.16.1 release version for
> testing purposes.
> Unfortunately they gave me absolutely zero information, since both of
> them are not able to get through issue #1
> the FlR related DPC / AER issue.
> With 4.16.1 release, it actually can survive the 'xl
> pci-assignable-add' which triggers the first AER failure.
> But the 'xl pci-assignable-remove' will lead to xl segmentation fault...
> >[  655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 
> >00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000]
> >[  655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 
> >0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 <48> 
> >8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44
> Since I'll need a couple of pci-assignable-add &&
> pci-assignable-remove to get to a seemingly normal state, I cannot
> proceed from here.
>
> With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl
> pci-assignable-add'.
>
> [  574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3,
> etc) the device
> [  574.623203] pcieport 0000:00:1d.0: DPC: containment event,
> status:0x1f11 source:0x0000
> [  574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error 
> detected
> [  574.623209] pcieport 0000:00:1d.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
> ID)
> [  574.623240] pcieport 0000:00:1d.0:   device [8086:a330] error
> status/mask=00200000/00010000
> [  574.623261] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
> [  575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting
> [  576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
> [  579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
> [  583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
> [  591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting
> [  609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
> [  643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
> //<=======The reboot happens somewhere here, not immediately, but
> after a while...
> //Maybe I can get something from xl dmesg if I was quick enough and
> have connected from a second terminal...

Unfortunately I didn't see anything from xl dmesg...
I wish the 'xl dmesg' can support the follow mode (dmesg -w) that the
Linux dmesg does.
Here I have to manually repeat this command. The machine suddenly
freezes after the 'giving up' message is out.
I see nothing special in the log. Maybe I'm just not lucky enough to
catch the output, not sure.

Reply via email to