HI Yuquan, For your test, the first logging will come from the AER driver if everything is working correctly.
You may want to check if the upstream pci bridge's AER UIE/CIE masks are set. This could prevent the error from handled by the OS's aer driver. Regards, Terry On 3/6/24 11:12, Terry Bowman wrote: > Hi Yuquan an Jon, > > I added responses inline below. > > On 3/6/24 07:23, Jonathan Cameron wrote: >> On Wed, 6 Mar 2024 19:27:07 +0800 >> Yuquan Wang <wangyuquan1...@phytium.com.cn> wrote: >> >>> Hello, Jonathan >>> >>> Recently I met some problems on CXL RAS tests. >>> >>> I tried to use "cxl-inject-uncorrectable-errors" and >>> "cxl-inject-correctable-error" >>> qmp to inject CXL errors, however, there was no any kernel printing >>> information in >>> my qemu machine. And the qmp connection was unstable that made the machine >>> always "terminating on signal 2". >> >> The qmp connection being unstable is odd - might be related to the CXL code, >> but >> I'm not sure how.. >> >>> >>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the >>> same conditions. >>> The kernel showed relevant print information. >> >> IIRC the AER paths print under all circumstances whereas CXL errors do not, >> they simply >> trigger tracepoints - but you should have seen device resets. >> >> However I span up a test and I think the issue is more straight forward. >> The uncorrectable internal error and correctable internal errors are masked >> on the device. >> I thought we changed the default on this in linux but maybe not :( >> > > Device AER UIE/CIE mask can be set and still expect to handle device AER > errors. The device reports > AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc > errors. > > In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to > properly receive > AER UIE/CI notifications from devices and RCH dports. > > "CXL Protocol and Link errors detected by components that are part of a CXL > VH are > escalated and reported using standard PCIe error reporting mechanisms over > CXL.io as > UIEs and/or CIEs. See PCIe Base Specification for details."[1] > > [1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting > >> Hack is fine the relevant device with lspci -tv and then use >> setpci -s 0d:00.0 0x208.l=0 >> to clear all the mask bits for uncorrectable errors. >> >> Note I tested this on a convenient arm64 setup so always possible there is >> yet >> another problem on x86. >> >> Robert / Terry, I tracked down the patch where you enabled this for RCHs and >> there was >> some discussion on walking out on VH as well to enable this, but seems it >> never happened. Can you remember why? Just kicked back for a future >> occasion? >> >> Jonathan >> >> > > I tested (qemu x86) using the aer-inject tool and found it to work. Below > shows the > endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly > handled > with root port logging and cxl_pci handler trace logs. > > # lspci | grep -i cxl > > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01) > > > > > > # lspci -s 0d:00.0 -vvv | grep Advanced > > > Capabilities: [200 v2] Advanced Error Reporting > > > > > > # setpci -s 0d:00.0 0x208.l > > > 02400000 > > > > > > # setpci -s 0d:00.0 0x214.l > > > 0000e000 > > > > > > # cat aer-input.txt > > > # Inject a correctable bad TLP error into the device with header log > > > # words 0 1 2 3. > > > # > > > # Either specify the PCI id on the command-line option or uncomment and > edit > > # the PCI_ID line below using the correct PCI ID. > > > # > > > # Note that system firmware/BIOS may mask certain errors and/or not > report > > # header log words. > > > # > > > AER > > > #PCI_ID 0000:0C.00.0 > > > COR_STATUS BAD_TLP > > > HEADER_LOG 0 1 2 3 > > > > > > # ./aer-inject -s 0000:0d:00.0 aer-input.txt > > > [ 72.850686] pcieport 0000:0c:00.0: aer_inject: Injecting errors > 00000040/00000000 into device 0000:0d:00.0 > > [ 72.851784] pcieport 0000:0c:00.0: AER: Corrected error received: > 0000:0d:00.0 > > [ 72.852594] cxl_pci 0000:0d:00.0: PCIe Bus Error: severity=Corrected, > type=Data Link Layer, (Receiver ID) > > [ 72.853591] cxl_pci 0000:0d:00.0: device [8086:0d93] error > status/mask=00000040/0000e000 > # [ 72.854277] cxl_pci 0000:0d:00.0: [ 6] BadTLP > > I have not tried to use cxl-inject-uncorrectable-errors or > cxl-inject-correctable-error. > > Regards, > Terry > >>> >>> Question: >>> 1) Is my CXL RAS test operations standard? >>> 2) The error injected by "pcie_aer_inject_error" is "protocol & link >>> errors" of cxl.io? >>> The error injected by "cxl-inject-uncorrectable-errors" or >>> "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem? >>> >>> Hope I can get some helps here, any help will be greatly appreciated. >>> >>> >>> My qemu command line: >>> qemu-system-x86_64 \ >>> -M q35,nvdimm=on,cxl=on \ >>> -m 4G \ >>> -smp 4 \ >>> -object memory-backend-ram,size=2G,id=mem0 \ >>> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ >>> -object memory-backend-ram,size=2G,id=mem1 \ >>> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ >>> -object memory-backend-ram,size=256M,id=cxl-mem0 \ >>> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ >>> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \ >>> -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \ >>> -M >>> cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k >>> \ >>> -hda ../disk/ubuntu_x86_test_new.qcow2 \ >>> -nographic \ >>> -qmp tcp:127.0.0.1:4444,server,nowait \ >>> >>> Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in >>> "https://gitlab.com/jic23/qemu" >>> Kernel version: 6.8.0-rc6 >>> >>> My steps in the Qemu qmp: >>> 1) telnet 127.0.0.1 4444 >>> >>> result: >>> Trying 127.0.0.1... >>> Connected to 127.0.0.1. >>> Escape character is '^]'. >>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, >>> "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}} >>> >>> 2) { "execute": "qmp_capabilities" } >>> >>> result: >>> {"return": {}} >>> >>> 3) If inject correctable error: >>> { "execute": "cxl-inject-correctable-error", >>> "arguments": { >>> "path": "/machine/peripheral/cxl-mem0", >>> "type": "physical" >>> } } >>> >>> result: >>> {"return": {}} >>> >>> 3) If inject uncorrectable error: >>> { "execute": "cxl-inject-uncorrectable-errors", >>> "arguments": { >>> "path": "/machine/peripheral/cxl-mem0", >>> "errors": [ >>> { >>> "type": "cache-address-parity", >>> "header": [ 3, 4] >>> }, >>> { >>> "type": "cache-data-parity", >>> "header": >>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31] >>> }, >>> { >>> "type": "internal", >>> "header": [ 1, 2, 4] >>> } >>> ] >>> }} >>> >>> result: >>> {"return": {}} >>> {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": >>> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} >>> >>> Many thanks >>> Yuquan >>> >>