Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-04-24 Thread Shuai Xue
在 2025/3/17 14:02, Shuai Xue 写道: 在 2025/3/3 12:33, Shuai Xue 写道: 在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error

Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-16 Thread Shuai Xue
在 2025/3/3 12:33, Shuai Xue 写道: 在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device

Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-02 Thread Shuai Xue
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal error

Re: [PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error

2025-03-02 Thread Shuai Xue
在 2025/3/3 11:36, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that

Re: [PATCH v4 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-02 Thread Shuai Xue
在 2025/2/17 10:42, Shuai Xue 写道: changes since v3: - squash patch 1 and 2 into one patch per Sathyanarayanan - add comments note for dpc_process_error per Sathyanarayanan - pick up Reviewed-by tag from Sathyanarayanan changes since v2: - moving the "err_port" rename to a separate

[PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error

2025-02-16 Thread Shuai Xue
-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 28 drivers/pci/pcie/edr.c | 7 --- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 01e51db8d285..870d2fbd6ff2 100644 --- a

[PATCH v4 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-16 Thread Shuai Xue
Link Layer, (Receiver ID) nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Shuai Xue (3): PCI/DPC: Clarify naming for error port in DPC Handling PC

[PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-16 Thread Shuai Xue
:34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11

[PATCH v4 1/3] PCI/DPC: Clarify naming for error port in DPC Handling

2025-02-16 Thread Shuai Xue
. Signed-off-by: Shuai Xue Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/pci/pcie/dpc.c | 10 +- drivers/pci/pcie/edr.c | 34 +- 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index

Re: [PATCH v3 3/4] PCI/DPC: Run recovery on device that detected the error

2025-02-11 Thread Shuai Xue
在 2025/2/12 05:23, Sathyanarayanan Kuppuswamy 写道: On 2/7/25 1:34 AM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that

Re: [PATCH v3 1/4] PCI/DPC: Rename pdev to err_port for dpc_handler

2025-02-11 Thread Shuai Xue
在 2025/2/12 05:17, Sathyanarayanan Kuppuswamy 写道: On 2/7/25 1:34 AM, Shuai Xue wrote: The irq handler is registered for error port which recevie DPC interrupt. Rename pdev to err_port. No functional changes. Signed-off-by: Shuai Xue --- I think you can combine patch 1 & 2 into a si

[PATCH v3 1/4] PCI/DPC: Rename pdev to err_port for dpc_handler

2025-02-07 Thread Shuai Xue
The irq handler is registered for error port which recevie DPC interrupt. Rename pdev to err_port. No functional changes. Signed-off-by: Shuai Xue --- drivers/pci/pcie/dpc.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie

[PATCH v3 0/4] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-07 Thread Shuai Xue
:03.0: AER: broadcast slot_reset message Shuai Xue (4): PCI/DPC: Rename pdev to err_port for dpc_handler PCI/EDR: Rename edev to err_port for edr_handle_event PCI/DPC: Run recovery on device that detected the error PCI/AER: Report fatal errors of RCiEP and EP if link recoverd drivers/pci/pci.h

[PATCH v3 4/4] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-07 Thread Shuai Xue
:34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11

[PATCH v3 3/4] PCI/DPC: Run recovery on device that detected the error

2025-02-07 Thread Shuai Xue
device that detected the error. By passing this error device to pcie_do_recovery(), subsequent patches will be able to accurately access AER status of the error device. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 25 + drivers/pci

[PATCH v3 2/4] PCI/EDR: Rename edev to err_port for edr_handle_event

2025-02-07 Thread Shuai Xue
acpi_dpc_port_get() locate the port that experienced the containment event. Rename edev to err_port for clear so that later patch will avoid misused err_port in pcie_do_recovery(). No functional changes. Suggested-by: Sathyanarayanan Kuppuswamy Signed-off-by: Shuai Xue --- drivers/pci/pcie

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-01-23 Thread Shuai Xue
在 2025/1/24 04:10, Sathyanarayanan Kuppuswamy 写道: Hi, On 11/12/24 5:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal

Re: [PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

2025-01-22 Thread Shuai Xue
在 2025/1/23 12:53, Sathyanarayanan Kuppuswamy 写道: On 11/12/24 5:54 AM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that

Re: [PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-01-22 Thread Shuai Xue
Hi, all, Gentle ping. Best Regards, Shuai 在 2024/12/24 19:03, Shuai Xue 写道: 在 2024/11/12 21:54, Shuai Xue 写道: changes since v1: - rewrite commit log per Bjorn - refactor aer_get_device_error_info to reduce duplication per Keith - fix to avoid reporting fatal errors twice for root and

Re: [PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-12-24 Thread Shuai Xue
在 2024/11/12 21:54, Shuai Xue 写道: changes since v1: - rewrite commit log per Bjorn - refactor aer_get_device_error_info to reduce duplication per Keith - fix to avoid reporting fatal errors twice for root and downstream ports per Keith The AER driver has historically avoided reading the

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-24 Thread Shuai Xue
在 2024/11/17 21:36, Shuai Xue 写道: 在 2024/11/16 20:44, Shuai Xue 写道: 在 2024/11/16 04:20, Bowman, Terry 写道: Hi Shuai, On 11/12/2024 7:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-17 Thread Shuai Xue
在 2024/11/16 20:44, Shuai Xue 写道: 在 2024/11/16 04:20, Bowman, Terry 写道: Hi Shuai, On 11/12/2024 7:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-16 Thread Shuai Xue
在 2024/11/16 04:20, Bowman, Terry 写道: Hi Shuai, On 11/12/2024 7:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-15 Thread Shuai Xue
在 2024/11/15 17:06, Lukas Wunner 写道: On Tue, Nov 12, 2024 at 09:54:19PM +0800, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. It would be good if

[PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

2024-11-12 Thread Shuai Xue
device that detected the error. By passing this error device to pcie_do_recovery(), subsequent patches will be able to accurately access AER status of the error device. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 30 -- drivers

[PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-12 Thread Shuai Xue
:34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11

[PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-12 Thread Shuai Xue
, (Receiver ID) nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Shuai Xue (2): PCI/DPC: Run recovery on device that detected the error PCI/AER

Re: [RFC PATCH v1 2/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
在 2024/11/7 00:39, Keith Busch 写道: On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote: +int aer_get_device_fatal_error_info(struct pci_dev *dev, struct aer_err_info *info) +{ + int type = pci_pcie_type(dev); + int aer = dev->aer_cap; + u32 aercc; + + pci_i

Re: [RFC PATCH v1 2/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
在 2024/11/7 00:02, Bjorn Helgaas 写道: On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a

[RFC PATCH v1 0/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
[144d:a804] error status/mask=0010/00504000 [ 414.815305] nvme :34:00.0:[ 4] DLP(First) [ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message Shuai Xue (2): PCI/AER: run recovery on device that detected the error PCI/AER: report fatal errors of

[RFC PATCH v1 2/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
) [ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 1 + drivers/pci/pcie/aer.c | 50 ++ drivers/pci/pcie/err.c | 6 + 3 files changed, 57 insertions(+) diff --git a/drivers/pci

[RFC PATCH v1 1/2] PCI/AER: run recovery on device that detected the error

2024-11-06 Thread Shuai Xue
device that detected the error. By passing this error-detecting device to pcie_do_recovery(), subsequent patches will be able to accurately access the AER error status. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 30 -- drivers

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-26 Thread Shuai Xue
On 2023/9/27 07:02, Bjorn Helgaas wrote: > On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote: >> ... > >> Actually, this is a question from my colleague from firmware team. >> The original question is that: >> >> "Should I set CPER_SEV_FA

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-24 Thread Shuai Xue
On 2023/9/21 21:20, David Laight wrote: > ... > I've got a target to generate AER errors by generating read cycles > that are inside the address range that the bridge forwards but > outside of any BAR because there are 2 different sized BARs. > (Pretty easy to setup.) > On the system I was using

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue
+ @Rafael for the APEI/GHES part. On 2023/9/22 05:52, Bjorn Helgaas wrote: > On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote: >> On 2023/9/21 07:02, Bjorn Helgaas wrote: >>> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote: >> ... > >>

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue
On 2023/9/21 07:02, Bjorn Helgaas wrote: > On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote: >> Hi, all folks, >> >> Error reporting and recovery are one of the important features of PCIe, and >> the kernel has been supporting them since version 2.6, 17 year

Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-18 Thread Shuai Xue
Hi, all folks, Error reporting and recovery are one of the important features of PCIe, and the kernel has been supporting them since version 2.6, 17 years ago. I am very curious about the expected behavior of the software. I first recap the error classification and then list my questions bellow it

Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-25 Thread Shuai Xue
在 2022/10/23 PM11:52, Shuai Xue 写道: > > > 在 2022/10/22 AM4:01, Tony Luck 写道: >> Part 1 deals with the process that triggered the copy on write >> fault with a store to a shared read-only page. That process is >> send a SIGBUS with the usual machine check decoratio

Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-23 Thread Shuai Xue
th the scope. > > Part 2 sets up to asynchronously take the page with the uncorrected > error offline to prevent additional machine check faults. H/t to > Miaohe Lin and Shuai Xue > for pointing me to the existing function to queue a call to > memory_failure(). > > On x

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-23 Thread Shuai Xue
在 2022/10/22 AM12:30, Luck, Tony 写道: >>> But maybe it is some RMW instruction ... then, if all the above options >>> didn't happen ... we >>> could get another machine check from the same address. But then we just >>> follow the usual >>> recovery path. > > >> Let assume the instruction that

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue
在 2022/10/21 PM12:41, Luck, Tony 写道: >>> When we do return to user mode the task is going to be busy servicing >>> a SIGBUS ... so shouldn't try to touch the poison page before the >>> memory_failure() called by the worker thread cleans things up. >> >> What about an RT process on a busy system?

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/21 PM12:08, Tony Luck 写道: > On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/21 AM4:05, Tony Luck 写道: >>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >>>> >>>> >>>> 在 20

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/21 AM4:05, Tony Luck 写道: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable erro

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/20 AM1:08, Tony Luck 写道: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test c