Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]

Terry Bowman Wed, 06 Mar 2024 11:07:52 -0800

HI Yuquan,

For your test, the first logging will come from the AER driver if 
everything is working correctly.


You may want to check if the upstream pci bridge's AER UIE/CIE 
masks are set. This could prevent the error from handled by the OS's
aer driver.

Regards,
Terry

On 3/6/24 11:12, Terry Bowman wrote:
> Hi Yuquan an Jon,
> 
> I added responses inline below.
> 
> On 3/6/24 07:23, Jonathan Cameron wrote:
>> On Wed, 6 Mar 2024 19:27:07 +0800
>> Yuquan Wang <wangyuquan1...@phytium.com.cn> wrote:
>>
>>> Hello, Jonathan
>>>
>>> Recently I met some problems on CXL RAS tests. 
>>>
>>> I tried to use "cxl-inject-uncorrectable-errors" and 
>>> "cxl-inject-correctable-error"
>>> qmp to inject CXL errors, however, there was no any kernel printing 
>>> information in 
>>> my qemu machine. And the qmp connection was unstable that made the machine 
>>> always "terminating on signal 2".
>>
>> The qmp connection being unstable is odd - might be related to the CXL code, 
>> but
>> I'm not sure how..
>>
>>>
>>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the 
>>> same conditions.
>>> The kernel showed relevant print information.
>>
>> IIRC the AER paths print under all circumstances whereas CXL errors do not, 
>> they simply
>> trigger tracepoints - but you should have seen device resets.
>>
>> However I span up a test and I think the issue is more straight forward.
>> The uncorrectable internal error and correctable internal errors are masked 
>> on the device.
>> I thought we changed the default on this in linux but maybe not :(
>>
> 
> Device AER UIE/CIE mask can be set and still expect to handle device AER 
> errors. The device reports 
> AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc 
> errors. 
> 
> In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to 
> properly receive 
> AER UIE/CI notifications from devices and RCH dports.
> 
> "CXL Protocol and Link errors detected by components that are part of a CXL 
> VH are
> escalated and reported using standard PCIe error reporting mechanisms over 
> CXL.io as
> UIEs and/or CIEs. See PCIe Base Specification for details."[1]
> 
> [1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting
> 
>> Hack is fine the relevant device with lspci -tv and then use
>> setpci -s 0d:00.0 0x208.l=0
>> to clear all the mask bits for uncorrectable errors.
>>
>> Note I tested this on a convenient arm64 setup so always possible there is 
>> yet
>> another problem on x86.
>>
>> Robert / Terry, I tracked down the patch where you enabled this for RCHs and 
>> there was
>> some discussion on walking out on VH as well to enable this, but seems it
>> never happened. Can you remember why?  Just kicked back for a future 
>> occasion?
>>
>> Jonathan
>>
>>
> 
> I tested (qemu x86) using the aer-inject tool and found it to work. Below 
> shows the 
> endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly 
> handled
> with root port logging and cxl_pci handler trace logs.
> 
>  # lspci | grep -i cxl                                                        
>                                                                              
>     0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)                       
>                                                                               
>   
>                                                                               
>                                                                               
>   
>     # lspci -s 0d:00.0 -vvv | grep Advanced                                   
>                                                                               
>   
>     Capabilities: [200 v2] Advanced Error Reporting                           
>                                                                               
>   
>                                                                               
>                                                                               
>   
>     # setpci -s 0d:00.0 0x208.l                                               
>                                                                               
>   
>     02400000                                                                  
>                                                                               
>   
>                                                                               
>                                                                               
>   
>     # setpci -s 0d:00.0 0x214.l                                               
>                                                                               
>   
>     0000e000                                                                  
>                                                                               
>   
>                                                                               
>                                                                               
>   
>     # cat aer-input.txt                                                       
>                                                                               
>   
>     # Inject a correctable bad TLP error into the device with header log      
>                                                                               
>   
>     # words 0 1 2 3.                                                          
>                                                                               
>   
>     #                                                                         
>                                                                               
>   
>     # Either specify the PCI id on the command-line option or uncomment and 
> edit                                                                          
>     
>     # the PCI_ID line below using the correct PCI ID.                         
>                                                                               
>   
>     #                                                                         
>                                                                               
>   
>     # Note that system firmware/BIOS may mask certain errors and/or not 
> report                                                                        
>         
>     # header log words.                                                       
>                                                                               
>   
>     #                                                                         
>                                                                               
>   
>     AER                                                                       
>                                                                               
>   
>     #PCI_ID 0000:0C.00.0                                                      
>                                                                               
>   
>     COR_STATUS BAD_TLP                                                        
>                                                                               
>   
>     HEADER_LOG 0 1 2 3                                                        
>                                                                               
>   
>                                                                               
>                                                                               
>   
>     # ./aer-inject -s 0000:0d:00.0 aer-input.txt                              
>                                                                               
>   
>     [   72.850686] pcieport 0000:0c:00.0: aer_inject: Injecting errors 
> 00000040/00000000 into device 0000:0d:00.0                                    
>          
>     [   72.851784] pcieport 0000:0c:00.0: AER: Corrected error received: 
> 0000:0d:00.0                                                                  
>        
>     [   72.852594] cxl_pci 0000:0d:00.0: PCIe Bus Error: severity=Corrected, 
> type=Data Link Layer, (Receiver ID)                                           
>    
>     [   72.853591] cxl_pci 0000:0d:00.0:   device [8086:0d93] error 
> status/mask=00000040/0000e000                                             
>     # [   72.854277] cxl_pci 0000:0d:00.0:    [ 6] BadTLP      
> 
> I have not tried to use cxl-inject-uncorrectable-errors or 
> cxl-inject-correctable-error.
> 
> Regards,
> Terry
> 
>>>
>>> Question:
>>> 1) Is my CXL RAS test operations standard?
>>> 2) The error injected by "pcie_aer_inject_error" is "protocol & link 
>>> errors" of cxl.io?
>>>    The error injected by "cxl-inject-uncorrectable-errors" or 
>>> "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem?
>>>
>>> Hope I can get some helps here, any help will be greatly appreciated.
>>>
>>>
>>> My qemu command line:
>>> qemu-system-x86_64 \
>>> -M q35,nvdimm=on,cxl=on \
>>> -m 4G \
>>> -smp 4 \
>>> -object memory-backend-ram,size=2G,id=mem0 \
>>> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \
>>> -object memory-backend-ram,size=2G,id=mem1 \
>>> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \
>>> -object memory-backend-ram,size=256M,id=cxl-mem0 \
>>> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>>> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
>>> -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
>>> -M 
>>> cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
>>>  \
>>> -hda ../disk/ubuntu_x86_test_new.qcow2 \
>>> -nographic \
>>> -qmp tcp:127.0.0.1:4444,server,nowait \
>>>
>>> Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in 
>>> "https://gitlab.com/jic23/qemu"; 
>>> Kernel version: 6.8.0-rc6
>>>
>>> My steps in the Qemu qmp:
>>> 1) telnet 127.0.0.1 4444
>>>
>>> result:
>>> Trying 127.0.0.1...
>>> Connected to 127.0.0.1.
>>> Escape character is '^]'.
>>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, 
>>> "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}}
>>>
>>> 2) { "execute": "qmp_capabilities" }
>>>
>>> result:
>>> {"return": {}}
>>>
>>> 3) If inject correctable error:
>>> { "execute": "cxl-inject-correctable-error",
>>>     "arguments": {
>>>         "path": "/machine/peripheral/cxl-mem0",
>>>         "type": "physical"
>>>     } }
>>>
>>> result:
>>> {"return": {}}
>>>
>>> 3) If inject uncorrectable error:
>>> { "execute": "cxl-inject-uncorrectable-errors",
>>>   "arguments": {
>>>     "path": "/machine/peripheral/cxl-mem0",
>>>     "errors": [
>>>         {
>>>             "type": "cache-address-parity",
>>>             "header": [ 3, 4]
>>>         },
>>>         {
>>>             "type": "cache-data-parity",
>>>             "header": 
>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
>>>         },
>>>         {
>>>             "type": "internal",
>>>             "header": [ 1, 2, 4]
>>>         }
>>>         ]
>>>   }}
>>>
>>> result:
>>> {"return": {}}
>>> {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": 
>>> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
>>>
>>> Many thanks
>>> Yuquan
>>>
>>

Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]

Reply via email to