Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-18 Thread Shuai Xue
Hi, all folks, Error reporting and recovery are one of the important features of PCIe, and the kernel has been supporting them since version 2.6, 17 years ago. I am very curious about the expected behavior of the software. I first recap the error classification and then list my questions bellow it

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue
On 2023/9/21 07:02, Bjorn Helgaas wrote: > On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote: >> Hi, all folks, >> >> Error reporting and recovery are one of the important features of PCIe, and >> the kernel has been supporting them since version 2.6, 17 year

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue
+ @Rafael for the APEI/GHES part. On 2023/9/22 05:52, Bjorn Helgaas wrote: > On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote: >> On 2023/9/21 07:02, Bjorn Helgaas wrote: >>> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote: >> ... > >>

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-24 Thread Shuai Xue
On 2023/9/21 21:20, David Laight wrote: > ... > I've got a target to generate AER errors by generating read cycles > that are inside the address range that the bridge forwards but > outside of any BAR because there are 2 different sized BARs. > (Pretty easy to setup.) > On the system I was using

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-26 Thread Shuai Xue
On 2023/9/27 07:02, Bjorn Helgaas wrote: > On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote: >> ... > >> Actually, this is a question from my colleague from firmware team. >> The original question is that: >> >> "Should I set CPER_SEV_FA

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/20 AM1:08, Tony Luck 写道: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test c

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/21 AM4:05, Tony Luck 写道: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable erro

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/21 PM12:08, Tony Luck 写道: > On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/21 AM4:05, Tony Luck 写道: >>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >>>> >>>> >>>> 在 20

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue
在 2022/10/21 PM12:41, Luck, Tony 写道: >>> When we do return to user mode the task is going to be busy servicing >>> a SIGBUS ... so shouldn't try to touch the poison page before the >>> memory_failure() called by the worker thread cleans things up. >> >> What about an RT process on a busy system?

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-23 Thread Shuai Xue
在 2022/10/22 AM12:30, Luck, Tony 写道: >>> But maybe it is some RMW instruction ... then, if all the above options >>> didn't happen ... we >>> could get another machine check from the same address. But then we just >>> follow the usual >>> recovery path. > > >> Let assume the instruction that

Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-23 Thread Shuai Xue
th the scope. > > Part 2 sets up to asynchronously take the page with the uncorrected > error offline to prevent additional machine check faults. H/t to > Miaohe Lin and Shuai Xue > for pointing me to the existing function to queue a call to > memory_failure(). > > On x

Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-25 Thread Shuai Xue
在 2022/10/23 PM11:52, Shuai Xue 写道: > > > 在 2022/10/22 AM4:01, Tony Luck 写道: >> Part 1 deals with the process that triggered the copy on write >> fault with a store to a shared read-only page. That process is >> send a SIGBUS with the usual machine check decoratio

[PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

2024-11-12 Thread Shuai Xue
device that detected the error. By passing this error device to pcie_do_recovery(), subsequent patches will be able to accurately access AER status of the error device. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 30 -- drivers

[PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-12 Thread Shuai Xue
:34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11

[PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-12 Thread Shuai Xue
, (Receiver ID) nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Shuai Xue (2): PCI/DPC: Run recovery on device that detected the error PCI/AER

[RFC PATCH v1 1/2] PCI/AER: run recovery on device that detected the error

2024-11-06 Thread Shuai Xue
device that detected the error. By passing this error-detecting device to pcie_do_recovery(), subsequent patches will be able to accurately access the AER error status. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 30 -- drivers

[RFC PATCH v1 2/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
) [ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 1 + drivers/pci/pcie/aer.c | 50 ++ drivers/pci/pcie/err.c | 6 + 3 files changed, 57 insertions(+) diff --git a/drivers/pci

[RFC PATCH v1 0/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
[144d:a804] error status/mask=0010/00504000 [ 414.815305] nvme :34:00.0:[ 4] DLP(First) [ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message Shuai Xue (2): PCI/AER: run recovery on device that detected the error PCI/AER: report fatal errors of

Re: [RFC PATCH v1 2/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
在 2024/11/7 00:39, Keith Busch 写道: On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote: +int aer_get_device_fatal_error_info(struct pci_dev *dev, struct aer_err_info *info) +{ + int type = pci_pcie_type(dev); + int aer = dev->aer_cap; + u32 aercc; + + pci_i

Re: [RFC PATCH v1 2/2] PCI/AER: report fatal errors of RCiEP and EP if link recoverd

2024-11-06 Thread Shuai Xue
在 2024/11/7 00:02, Bjorn Helgaas 写道: On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-24 Thread Shuai Xue
在 2024/11/17 21:36, Shuai Xue 写道: 在 2024/11/16 20:44, Shuai Xue 写道: 在 2024/11/16 04:20, Bowman, Terry 写道: Hi Shuai, On 11/12/2024 7:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-15 Thread Shuai Xue
在 2024/11/15 17:06, Lukas Wunner 写道: On Tue, Nov 12, 2024 at 09:54:19PM +0800, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. It would be good if

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-17 Thread Shuai Xue
在 2024/11/16 20:44, Shuai Xue 写道: 在 2024/11/16 04:20, Bowman, Terry 写道: Hi Shuai, On 11/12/2024 7:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-11-16 Thread Shuai Xue
在 2024/11/16 04:20, Bowman, Terry 写道: Hi Shuai, On 11/12/2024 7:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal

Re: [PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2024-12-24 Thread Shuai Xue
在 2024/11/12 21:54, Shuai Xue 写道: changes since v1: - rewrite commit log per Bjorn - refactor aer_get_device_error_info to reduce duplication per Keith - fix to avoid reporting fatal errors twice for root and downstream ports per Keith The AER driver has historically avoided reading the

[PATCH v3 2/4] PCI/EDR: Rename edev to err_port for edr_handle_event

2025-02-07 Thread Shuai Xue
acpi_dpc_port_get() locate the port that experienced the containment event. Rename edev to err_port for clear so that later patch will avoid misused err_port in pcie_do_recovery(). No functional changes. Suggested-by: Sathyanarayanan Kuppuswamy Signed-off-by: Shuai Xue --- drivers/pci/pcie

[PATCH v3 3/4] PCI/DPC: Run recovery on device that detected the error

2025-02-07 Thread Shuai Xue
device that detected the error. By passing this error device to pcie_do_recovery(), subsequent patches will be able to accurately access AER status of the error device. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 25 + drivers/pci

[PATCH v3 4/4] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-07 Thread Shuai Xue
:34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11

[PATCH v3 0/4] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-07 Thread Shuai Xue
:03.0: AER: broadcast slot_reset message Shuai Xue (4): PCI/DPC: Rename pdev to err_port for dpc_handler PCI/EDR: Rename edev to err_port for edr_handle_event PCI/DPC: Run recovery on device that detected the error PCI/AER: Report fatal errors of RCiEP and EP if link recoverd drivers/pci/pci.h

[PATCH v3 1/4] PCI/DPC: Rename pdev to err_port for dpc_handler

2025-02-07 Thread Shuai Xue
The irq handler is registered for error port which recevie DPC interrupt. Rename pdev to err_port. No functional changes. Signed-off-by: Shuai Xue --- drivers/pci/pcie/dpc.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie

Re: [PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-01-22 Thread Shuai Xue
Hi, all, Gentle ping. Best Regards, Shuai 在 2024/12/24 19:03, Shuai Xue 写道: 在 2024/11/12 21:54, Shuai Xue 写道: changes since v1: - rewrite commit log per Bjorn - refactor aer_get_device_error_info to reduce duplication per Keith - fix to avoid reporting fatal errors twice for root and

Re: [PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

2025-01-22 Thread Shuai Xue
在 2025/1/23 12:53, Sathyanarayanan Kuppuswamy 写道: On 11/12/24 5:54 AM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that

Re: [PATCH v2 2/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-01-23 Thread Shuai Xue
在 2025/1/24 04:10, Sathyanarayanan Kuppuswamy 写道: Hi, On 11/12/24 5:54 AM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal

[PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-16 Thread Shuai Xue
:34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11

[PATCH v4 1/3] PCI/DPC: Clarify naming for error port in DPC Handling

2025-02-16 Thread Shuai Xue
. Signed-off-by: Shuai Xue Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/pci/pcie/dpc.c | 10 +- drivers/pci/pcie/edr.c | 34 +- 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index

[PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error

2025-02-16 Thread Shuai Xue
-off-by: Shuai Xue --- drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 28 drivers/pci/pcie/edr.c | 7 --- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 01e51db8d285..870d2fbd6ff2 100644 --- a

[PATCH v4 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-02-16 Thread Shuai Xue
Link Layer, (Receiver ID) nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Shuai Xue (3): PCI/DPC: Clarify naming for error port in DPC Handling PC

Re: [PATCH v3 1/4] PCI/DPC: Rename pdev to err_port for dpc_handler

2025-02-11 Thread Shuai Xue
在 2025/2/12 05:17, Sathyanarayanan Kuppuswamy 写道: On 2/7/25 1:34 AM, Shuai Xue wrote: The irq handler is registered for error port which recevie DPC interrupt. Rename pdev to err_port. No functional changes. Signed-off-by: Shuai Xue --- I think you can combine patch 1 & 2 into a si

Re: [PATCH v3 3/4] PCI/DPC: Run recovery on device that detected the error

2025-02-11 Thread Shuai Xue
在 2025/2/12 05:23, Sathyanarayanan Kuppuswamy 写道: On 2/7/25 1:34 AM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that

Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-16 Thread Shuai Xue
在 2025/3/3 12:33, Shuai Xue 写道: 在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device

Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-04-24 Thread Shuai Xue
在 2025/3/17 14:02, Shuai Xue 写道: 在 2025/3/3 12:33, Shuai Xue 写道: 在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error

Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-02 Thread Shuai Xue
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal error

Re: [PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error

2025-03-02 Thread Shuai Xue
在 2025/3/3 11:36, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that

Re: [PATCH v4 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-02 Thread Shuai Xue
在 2025/2/17 10:42, Shuai Xue 写道: changes since v3: - squash patch 1 and 2 into one patch per Sathyanarayanan - add comments note for dpc_process_error per Sathyanarayanan - pick up Reviewed-by tag from Sathyanarayanan changes since v2: - moving the "err_port" rename to a separate

Re: [PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error

2025-06-18 Thread Shuai Xue
在 2025/6/12 18:31, Manivannan Sadhasivam 写道: On Mon, Feb 17, 2025 at 10:42:17AM +0800, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. s/on/for However, the DPC driver currently passes

Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

2025-07-24 Thread Shuai Xue
Hi, Breno, 在 2025/7/23 00:56, Breno Leitao 写道: Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that did not cause a panic) and record them for vmcore consumption. This aids post-mortem crash analysis tools by preserving a count and timestamp for the last oc

Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

2025-07-25 Thread Shuai Xue
在 2025/7/24 21:34, Breno Leitao 写道: Hello Shuai, On Thu, Jul 24, 2025 at 04:00:09PM +0800, Shuai Xue wrote: 在 2025/7/23 00:56, Breno Leitao 写道: Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that did not cause a panic) and record them for vmcore

Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

2025-07-30 Thread Shuai Xue
在 2025/7/30 21:11, Breno Leitao 写道: Hello Shuai, On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote: In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and CPER_SEV_RECOVERABLE errors: Thanks. I was reading this code a bit more, and I want to make sure my understandi

Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

2025-07-27 Thread Shuai Xue
在 2025/7/26 00:16, Breno Leitao 写道: Hello Shuai, On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: APEI does not define an error type named GHES. GHES is just a kernel driver name. Many hardware error types can be handled in GHES (see ghes_do_proc), for example, AER is routed by

Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

2025-07-29 Thread Shuai Xue
在 2025/7/29 21:48, Breno Leitao 写道: On Mon, Jul 28, 2025 at 09:08:25AM +0800, Shuai Xue wrote: 在 2025/7/26 00:16, Breno Leitao 写道: On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: enum hwerr_error_type { HWERR_RECOV_MCE, // maps to errors in