On Wed, 6 Mar 2024 19:27:07 +0800 Yuquan Wang <wangyuquan1...@phytium.com.cn> wrote:
> Hello, Jonathan > > Recently I met some problems on CXL RAS tests. > > I tried to use "cxl-inject-uncorrectable-errors" and > "cxl-inject-correctable-error" > qmp to inject CXL errors, however, there was no any kernel printing > information in > my qemu machine. And the qmp connection was unstable that made the machine > always "terminating on signal 2". The qmp connection being unstable is odd - might be related to the CXL code, but I'm not sure how.. > > In addition, I successfully used the hmp "pcie_aer_inject_error" in the same > conditions. > The kernel showed relevant print information. IIRC the AER paths print under all circumstances whereas CXL errors do not, they simply trigger tracepoints - but you should have seen device resets. However I span up a test and I think the issue is more straight forward. The uncorrectable internal error and correctable internal errors are masked on the device. I thought we changed the default on this in linux but maybe not :( Hack is fine the relevant device with lspci -tv and then use setpci -s 0d:00.0 0x208.l=0 to clear all the mask bits for uncorrectable errors. Note I tested this on a convenient arm64 setup so always possible there is yet another problem on x86. Robert / Terry, I tracked down the patch where you enabled this for RCHs and there was some discussion on walking out on VH as well to enable this, but seems it never happened. Can you remember why? Just kicked back for a future occasion? Jonathan > > Question: > 1) Is my CXL RAS test operations standard? > 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" > of cxl.io? > The error injected by "cxl-inject-uncorrectable-errors" or > "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem? > > Hope I can get some helps here, any help will be greatly appreciated. > > > My qemu command line: > qemu-system-x86_64 \ > -M q35,nvdimm=on,cxl=on \ > -m 4G \ > -smp 4 \ > -object memory-backend-ram,size=2G,id=mem0 \ > -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ > -object memory-backend-ram,size=2G,id=mem1 \ > -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ > -object memory-backend-ram,size=256M,id=cxl-mem0 \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \ > -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \ > -M > cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k > \ > -hda ../disk/ubuntu_x86_test_new.qcow2 \ > -nographic \ > -qmp tcp:127.0.0.1:4444,server,nowait \ > > Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in > "https://gitlab.com/jic23/qemu" > Kernel version: 6.8.0-rc6 > > My steps in the Qemu qmp: > 1) telnet 127.0.0.1 4444 > > result: > Trying 127.0.0.1... > Connected to 127.0.0.1. > Escape character is '^]'. > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, > "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}} > > 2) { "execute": "qmp_capabilities" } > > result: > {"return": {}} > > 3) If inject correctable error: > { "execute": "cxl-inject-correctable-error", > "arguments": { > "path": "/machine/peripheral/cxl-mem0", > "type": "physical" > } } > > result: > {"return": {}} > > 3) If inject uncorrectable error: > { "execute": "cxl-inject-uncorrectable-errors", > "arguments": { > "path": "/machine/peripheral/cxl-mem0", > "errors": [ > { > "type": "cache-address-parity", > "header": [ 3, 4] > }, > { > "type": "cache-data-parity", > "header": > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31] > }, > { > "type": "internal", > "header": [ 1, 2, 4] > } > ] > }} > > result: > {"return": {}} > {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": > "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} > > Many thanks > Yuquan >