Update some findings with extra triage effort...
Detailed log could be found in the attachments.
1. Confirm stock Debian 11.2 kernel (5.10) shares the same syndrome..
2. With loglvl=all, it reveals why the mapping failure happens, looks
like it comes from some duplicated mapping..
(XEN) memory_map:add: dom1 gfn=f3074 mfn=a2610 nr=2
(XEN) memory_map:add: dom1 gfn=f3077 mfn=a2615 nr=1
(XEN) memory_map:add: dom1 gfn=f3078 mfn=a2616 nr=1 <===========Here
(XEN) ioport_map:add: dom1 gport=c140 mport=4060 nr=20
(XEN) ioport_map:add: dom1 gport=c170 mport=4090 nr=8
(XEN) ioport_map:add: dom1 gport=c178 mport=4080 nr=4
(XEN) memory_map:add: dom1 gfn=f3070 mfn=a2500 nr=2
(XEN) memory_map:add: dom1 gfn=f3073 mfn=a2503 nr=1
(XEN) memory_map:add: dom1 gfn=f3078 mfn=a2504 nr=1 <===========Here
(XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted
(XEN) domain_crash called from p2m.c:1301
(XEN) Domain 1 reported crashed by domain 0 on cpu#2:
(XEN) memory_map:fail: dom1 gfn=f3078 mfn=a2504 nr=1 ret:-1
(XEN) memory_map:remove: dom1 gfn=f3078 mfn=a2504 nr=1

3. Recompiled kernel with DEBUG enabled for xen_pciback driver and
play with xl pci-assignable-XXX with it
3.1 It's confirmed that the DPC / AER error log happens only when
xen_pciback attempts to seize && release the device
3.1.1 It only happens on each of the first add / remove operations.
3.2 There is still a 'MSI-X preparation failed' message later-on, but
otherwise it appears to be successful to add / remove the device after
the 1st attempt.
3.3 Not necessarily related, but the DPC / AER log looks similar to
this report [1]


[1]: 
https://patchwork.kernel.org/project/linux-pci/patch/20220127025418.1989642-1-kai.heng.f...@canonical.com/#24713767
PS: Attempting to fix the line-wrapping issue below... Have no idea
what happened about the formatting....

On Sun, Jul 3, 2022 at 1:43 AM G.R. <firemet...@users.sourceforge.net> wrote:
>
> Hi everybody,
>
> I run into problems passing through a SN570 NVME SSD to a HVM guest.
> So far I have no idea if the problem is with this specific SSD or with
> the CPU + motherboard combination or the SW stack.
> Looking for some suggestions on troubleshooting.
>
> List of build info:
> CPU+motherboard: E-2146G + Gigabyte C246N-WU2
> XEN version: 4.14.3
> Dom0: Linux Kernel 5.10 (built from Debian 11.2 kernel source package)
> The SN570 SSD sits here in the PCI tree:
>            +-1d.0-[05]----00.0  Sandisk Corp Device 501a
>
> Syndromes observed:
> With ASPM enabled, pciback has problem seizing the device.
>
> Jul  2 00:36:54 gaia kernel: [    1.648270] pciback 0000:05:00.0: 
> xen_pciback: seizing device
> ...
> Jul  2 00:36:54 gaia kernel: [    1.768646] pcieport 0000:00:1d.0: AER: 
> enabled with IRQ 150
> Jul  2 00:36:54 gaia kernel: [    1.768716] pcieport 0000:00:1d.0: DPC: 
> enabled with IRQ 150
> Jul  2 00:36:54 gaia kernel: [    1.768717] pcieport 0000:00:1d.0: DPC: error 
> containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO 
> Log 4, DL_ActiveErr+
> ...
> Jul  2 00:36:54 gaia kernel: [    1.770039] xen: registering gsi 16 
> triggering 0 polarity 1
> Jul  2 00:36:54 gaia kernel: [    1.770041] Already setup the GSI :16
> Jul  2 00:36:54 gaia kernel: [    1.770314] pcieport 0000:00:1d.0: DPC: 
> containment event, status:0x1f11 source:0x0000
> Jul  2 00:36:54 gaia kernel: [    1.770315] pcieport 0000:00:1d.0: DPC: 
> unmasked uncorrectable error detected
> Jul  2 00:36:54 gaia kernel: [    1.770320] pcieport 0000:00:1d.0: PCIe Bus 
> Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> Jul  2 00:36:54 gaia kernel: [    1.770371] pcieport 0000:00:1d.0: device 
> [8086:a330] error status/mask=00200000/00010000
> Jul  2 00:36:54 gaia kernel: [    1.770413] pcieport 0000:00:1d.0: [21] 
> ACSViol                (First)
> Jul  2 00:36:54 gaia kernel: [    1.770466] pciback 0000:05:00.0: 
> xen_pciback: device is not found/assigned
> Jul  2 00:36:54 gaia kernel: [    1.920195] pciback 0000:05:00.0: 
> xen_pciback: device is not found/assigned
> Jul  2 00:36:54 gaia kernel: [    1.920260] pcieport 0000:00:1d.0: AER: 
> device recovery successful
> Jul  2 00:36:54 gaia kernel: [    1.920263] pcieport 0000:00:1d.0: DPC: 
> containment event, status:0x1f01 source:0x0000
> Jul  2 00:36:54 gaia kernel: [    1.920264] pcieport 0000:00:1d.0: DPC: 
> unmasked uncorrectable error detected
> Jul  2 00:36:54 gaia kernel: [    1.920267] pciback 0000:05:00.0: 
> xen_pciback: device is not found/assigned
> Jul  2 00:36:54 gaia kernel: [    1.938406] xen: registering gsi 16 
> triggering 0 polarity 1
> Jul  2 00:36:54 gaia kernel: [    1.938408] Already setup the GSI :16
> Jul  2 00:36:54 gaia kernel: [    1.938666] xen_pciback: backend is vpci
> ...
> Jul  2 00:43:48 gaia kernel: [  420.231955] pcieport 0000:00:1d.0: DPC: 
> containment event, status:0x1f01 source:0x0000
> Jul  2 00:43:48 gaia kernel: [  420.231961] pcieport 0000:00:1d.0: DPC: 
> unmasked uncorrectable error detected
> Jul  2 00:43:48 gaia kernel: [  420.231993] pcieport 0000:00:1d.0: PCIe Bus 
> Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester 
> ID)
> Jul  2 00:43:48 gaia kernel: [  420.235775] pcieport 0000:00:1d.0: device 
> [8086:a330] error status/mask=00100000/00010000
> Jul  2 00:43:48 gaia kernel: [  420.235779] pcieport 0000:00:1d.0: [20] 
> UnsupReq               (First)
> Jul  2 00:43:48 gaia kernel: [  420.235783] pcieport 0000:00:1d.0: AER:   TLP 
> Header: 34000000 05000010 00000000 88458845
> Jul  2 00:43:48 gaia kernel: [  420.235819] pci 0000:05:00.0: AER: can't 
> recover (no error_detected callback)
> Jul  2 00:43:48 gaia kernel: [  420.384349] pcieport 0000:00:1d.0: AER: 
> device recovery successful
> ... // The following might relate to an attempt to assign the device to 
> guest, not very sure...
> Jul  2 00:46:06 gaia kernel: [  559.147333] pciback 0000:05:00.0: 
> xen_pciback: seizing device
> Jul  2 00:46:06 gaia kernel: [  559.147435] pciback 0000:05:00.0: enabling 
> device (0000 -> 0002)
> Jul  2 00:46:06 gaia kernel: [  559.147508] xen: registering gsi 16 
> triggering 0 polarity 1
> Jul  2 00:46:06 gaia kernel: [  559.147511] Already setup the GSI :16
> Jul  2 00:46:06 gaia kernel: [  559.147558] pciback 0000:05:00.0: 
> xen_pciback: MSI-X preparation failed (-6)
>
>
> With pcie_aspm=off, the error log related to pciback goes away.
> But I suspect there are still some problems hidden -- since I don't
> see any AER enabled messages so errors may be hidden.
> I have the xen_pciback built directly into the kernel and assigned the
> SSD to it in the kernel command-line.
> However, the result from pci-assignable-xxx commands are not very consistent:
>
> root@gaia:~# xl pci-assignable-list
> 0000:00:17.0
> 0000:05:00.0
> root@gaia:~# xl pci-assignable-remove 05:00.0
> libxl: error: libxl_pci.c:853:libxl__device_pci_assignable_remove: failed to 
> de-quarantine 0000:05:00.0 <===== Here!!!
> root@gaia:~# xl pci-assignable-add 05:00.0
> libxl: warning: libxl_pci.c:794:libxl__device_pci_assignable_add: 
> 0000:05:00.0 already assigned to pciback <==== Here!!!
> root@gaia:~# xl pci-assignable-remove 05:00.0
> root@gaia:~# xl pci-assignable-list 0000:00:17.0
> root@gaia:~# xl pci-assignable-add 05:00.0
> libxl: warning: libxl_pci.c:814:libxl__device_pci_assignable_add: 
> 0000:05:00.0 not bound to a driver, will not be rebound.
> root@gaia:~# xl pci-assignable-list
> 0000:00:17.0
> 0000:05:00.0
>
>
> After the 'xl pci-assignable-list' appears to be self-consistent, creating VM 
> with the SSD assigned still leads to a guest crash:
> From qemu log:
> [00:06.0] xen_pt_region_update: Error: create new mem mapping failed! (err: 1)
> qemu-system-i386: terminating on signal 1 from pid 1192 (xl)
>
> From the 'xl dmesg' output:
> (XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted
> (XEN) domain_crash called from p2m.c:1301
> (XEN) Domain 1 reported crashed by domain 0 on cpu#4:
> (XEN) memory_map:fail: dom1 gfn=f3078 mfn=a2504 nr=1 ret:-1
>
>
> Which of the three syndromes are more fundamental?
> 1. The DPC / AER error log
> 2. The inconsistency in 'xl pci-assignable-list' state tracking
> 3. The GFN mapping failure on guest setup
>
> Any suggestions for the next step?
>
>
> Thanks,
> G.R.
(XEN) HVM d1v0 save: CPU
(XEN) HVM d1v1 save: CPU
(XEN) HVM d1 save: PIC
(XEN) HVM d1 save: IOAPIC
(XEN) HVM d1v0 save: LAPIC
(XEN) HVM d1v1 save: LAPIC
(XEN) HVM d1v0 save: LAPIC_REGS
(XEN) HVM d1v1 save: LAPIC_REGS
(XEN) HVM d1 save: PCI_IRQ
(XEN) HVM d1 save: ISA_IRQ
(XEN) HVM d1 save: PCI_LINK
(XEN) HVM d1 save: PIT
(XEN) HVM d1 save: RTC
(XEN) HVM d1 save: HPET
(XEN) HVM d1 save: PMTIMER
(XEN) HVM d1v0 save: MTRR
(XEN) HVM d1v1 save: MTRR
(XEN) HVM d1 save: VIRIDIAN_DOMAIN
(XEN) HVM d1v0 save: CPU_XSAVE
(XEN) HVM d1v1 save: CPU_XSAVE
(XEN) HVM d1v0 save: VIRIDIAN_VCPU
(XEN) HVM d1v1 save: VIRIDIAN_VCPU
(XEN) HVM d1v0 save: VMCE_VCPU
(XEN) HVM d1v1 save: VMCE_VCPU
(XEN) HVM d1v0 save: TSC_ADJUST
(XEN) HVM d1v1 save: TSC_ADJUST
(XEN) HVM d1v0 save: CPU_MSR
(XEN) HVM d1v1 save: CPU_MSR
(XEN) HVM1 restore: CPU 0
(d1) HVM Loader
(d1) Detected Xen v4.14.3
(d1) Xenbus rings @0xfeffc000, event channel 1
(d1) System requested SeaBIOS
(d1) CPU speed is 3505 MHz
(d1) Relocating guest memory for lowmem MMIO space disabled
(d1) PCI-ISA link 0 routed to IRQ5
(d1) PCI-ISA link 1 routed to IRQ10
(d1) PCI-ISA link 2 routed to IRQ11
(d1) PCI-ISA link 3 routed to IRQ5
(d1) pci dev 01:3 INTA->IRQ10
(d1) pci dev 02:0 INTA->IRQ11
(d1) pci dev 04:0 INTA->IRQ5
(d1) pci dev 05:0 INTA->IRQ10
(d1) pci dev 06:0 INTA->IRQ11
(d1) RAM in high memory; setting high_mem resource base to 40f800000
(d1) pci dev 03:0 bar 10 size 002000000: 0f0000008
(d1) pci dev 02:0 bar 14 size 001000000: 0f2000008
(d1) pci dev 04:0 bar 30 size 000040000: 0f3000000
(d1) pci dev 04:0 bar 10 size 000020000: 0f3040000
(d1) pci dev 03:0 bar 30 size 000010000: 0f3060000
(d1) pci dev 06:0 bar 10 size 000004000: 0f3070004
(d1) pci dev 05:0 bar 10 size 000002000: 0f3074000
(d1) pci dev 03:0 bar 14 size 000001000: 0f3076000
(d1) pci dev 05:0 bar 24 size 000000800: 0f3077000
(d1) pci dev 02:0 bar 10 size 000000100: 00000c001
(d1) pci dev 05:0 bar 14 size 000000100: 0f3077800
(d1) pci dev 06:0 bar 20 size 000000100: 0f3077904
(d1) pci dev 04:0 bar 14 size 000000040: 00000c101
(d1) pci dev 05:0 bar 20 size 000000020: 00000c141
(d1) pci dev 01:1 bar 20 size 000000010: 00000c161
(d1) pci dev 05:0 bar 18 size 000000008: 00000c171
(d1) pci dev 05:0 bar 1c size 000000004: 00000c179
(XEN) memory_map:add: dom1 gfn=f3074 mfn=a2610 nr=2
(XEN) memory_map:add: dom1 gfn=f3077 mfn=a2615 nr=1
(XEN) memory_map:add: dom1 gfn=f3078 mfn=a2616 nr=1
(XEN) ioport_map:add: dom1 gport=c140 mport=4060 nr=20
(XEN) ioport_map:add: dom1 gport=c170 mport=4090 nr=8
(XEN) ioport_map:add: dom1 gport=c178 mport=4080 nr=4
(XEN) memory_map:add: dom1 gfn=f3070 mfn=a2500 nr=2
(XEN) memory_map:add: dom1 gfn=f3073 mfn=a2503 nr=1
(XEN) memory_map:add: dom1 gfn=f3078 mfn=a2504 nr=1
(XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted
(XEN) domain_crash called from p2m.c:1301
(XEN) Domain 1 reported crashed by domain 0 on cpu#2:
(XEN) memory_map:fail: dom1 gfn=f3078 mfn=a2504 nr=1 ret:-1
(XEN) memory_map:remove: dom1 gfn=f3078 mfn=a2504 nr=1

root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:814:libxl__device_pci_assignable_add: 0000:05:00.0 not bound to a driver, will not be rebound.

[  323.448115] xen_pciback: wants to seize 0000:05:00.0
[  323.448136] pciback 0000:05:00.0: xen_pciback: probing...
[  323.448137] pciback 0000:05:00.0: xen_pciback: seizing device
[  323.448162] pciback 0000:05:00.0: xen_pciback: pcistub_device_alloc
[  323.448162] pciback 0000:05:00.0: xen_pciback: initializing...
[  323.448163] pciback 0000:05:00.0: xen_pciback: initializing config
[  323.448344] pciback 0000:05:00.0: xen_pciback: enabling device
[  323.448425] xen: registering gsi 16 triggering 0 polarity 1
[  323.448428] Already setup the GSI :16
[  323.448497] pciback 0000:05:00.0: xen_pciback: save state of device
[  323.448642] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3, etc) the device
[  323.448707] pcieport 0000:00:1d.0: DPC: containment event, status:0x1f11 source:0x0000
[  323.448730] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error detected
[  323.448760] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[  323.448786] pcieport 0000:00:1d.0:   device [8086:a330] error status/mask=00200000/00010000
[  323.448813] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
[  324.690979] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting
[  325.730706] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
[  327.997638] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
[  332.264251] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
[  340.584320] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting
[  357.010896] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
[  391.143951] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
[  392.249252] pciback 0000:05:00.0: xen_pciback: reset device
[  392.249392] pciback 0000:05:00.0: xen_pciback: xen_pcibk_error_detected(bus:5,devfn:0)
[  392.249393] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  392.397074] pciback 0000:05:00.0: xen_pciback: xen_pcibk_error_resume(bus:5,devfn:0)
[  392.397080] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  392.397284] pcieport 0000:00:1d.0: AER: device recovery successful

libxl: error: libxl_pci.c:835:libxl__device_pci_assignable_add: failed to quarantine 0000:05:00.0

root@gaia:~# xl pci-assignable-remove 05:00.0
libxl: error: libxl_pci.c:853:libxl__device_pci_assignable_remove: failed to de-quarantine 0000:05:00.0
root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:794:libxl__device_pci_assignable_add: 0000:05:00.0 already assigned to pciback
root@gaia:~# xl pci-assignable-remove 05:00.0
[  603.928039] pciback 0000:05:00.0: xen_pciback: removing
[  603.928041] pciback 0000:05:00.0: xen_pciback: found device to remove 
[  603.928042] pciback 0000:05:00.0: xen_pciback: pcistub_device_release
[  604.033372] pcieport 0000:00:1d.0: DPC: containment event, status:0x1f11 source:0x0000
[  604.033512] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error detected
[  604.033631] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[  604.033758] pcieport 0000:00:1d.0:   device [8086:a330] error status/mask=00100000/00010000
[  604.033856] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
[  604.033939] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 05000010 00000000 88458845
[  604.034059] pci 0000:05:00.0: AER: can't recover (no error_detected callback)
[  604.034421] xen_pciback: removed 0000:05:00.0 from seize list
[  604.182597] pcieport 0000:00:1d.0: AER: device recovery successful

root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:814:libxl__device_pci_assignable_add: 0000:05:00.0 not bound to a driver, will not be rebound.
[  667.582051] xen_pciback: wants to seize 0000:05:00.0
[  667.582130] pciback 0000:05:00.0: xen_pciback: probing...
[  667.582134] pciback 0000:05:00.0: xen_pciback: seizing device
[  667.582228] pciback 0000:05:00.0: xen_pciback: pcistub_device_alloc
[  667.582231] pciback 0000:05:00.0: xen_pciback: initializing...
[  667.582235] pciback 0000:05:00.0: xen_pciback: initializing config
[  667.582548] pciback 0000:05:00.0: xen_pciback: enabling device
[  667.582599] pciback 0000:05:00.0: enabling device (0000 -> 0002)
[  667.582912] xen: registering gsi 16 triggering 0 polarity 1
[  667.582923] Already setup the GSI :16
[  667.583061] pciback 0000:05:00.0: xen_pciback: MSI-X preparation failed (-6)
[  667.583148] pciback 0000:05:00.0: xen_pciback: save state of device
[  667.583569] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3, etc) the device
[  667.689656] pciback 0000:05:00.0: xen_pciback: reset device

root@gaia:~# xl pci-assignable-remove 05:00.0
[  720.957988] pciback 0000:05:00.0: xen_pciback: removing
[  720.957996] pciback 0000:05:00.0: xen_pciback: found device to remove 
[  720.957999] pciback 0000:05:00.0: xen_pciback: pcistub_device_release
[  721.065222] pciback 0000:05:00.0: xen_pciback: MSI-X release failed (-16)
[  721.065667] xen_pciback: removed 0000:05:00.0 from seize list

root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:814:libxl__device_pci_assignable_add: 0000:05:00.0 not bound to a driver, will not be rebound.

[  763.888631] xen_pciback: wants to seize 0000:05:00.0
[  763.888690] pciback 0000:05:00.0: xen_pciback: probing...
[  763.888691] pciback 0000:05:00.0: xen_pciback: seizing device
[  763.888716] pciback 0000:05:00.0: xen_pciback: pcistub_device_alloc
[  763.888717] pciback 0000:05:00.0: xen_pciback: initializing...
[  763.888717] pciback 0000:05:00.0: xen_pciback: initializing config
[  763.888804] pciback 0000:05:00.0: xen_pciback: enabling device
[  763.888885] xen: registering gsi 16 triggering 0 polarity 1
[  763.888889] Already setup the GSI :16
[  763.888949] pciback 0000:05:00.0: xen_pciback: MSI-X preparation failed (-6)
[  763.888977] pciback 0000:05:00.0: xen_pciback: save state of device
[  763.889126] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3, etc) the device
[  763.994206] pciback 0000:05:00.0: xen_pciback: reset device

root@gaia:~# xl pci-assignable-remove 05:00.0
[  819.491000] pciback 0000:05:00.0: xen_pciback: removing
[  819.491002] pciback 0000:05:00.0: xen_pciback: found device to remove 
[  819.491003] pciback 0000:05:00.0: xen_pciback: pcistub_device_release
[  819.596113] pciback 0000:05:00.0: xen_pciback: MSI-X release failed (-16)
[  819.596466] xen_pciback: removed 0000:05:00.0 from seize list

Reply via email to