Hi, all folks,
Error reporting and recovery are one of the important features of PCIe, and
the kernel has been supporting them since version 2.6, 17 years ago.
I am very curious about the expected behavior of the software.
I first recap the error classification and then list my questions bellow it
On 2023/9/21 07:02, Bjorn Helgaas wrote:
> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
>> Hi, all folks,
>>
>> Error reporting and recovery are one of the important features of PCIe, and
>> the kernel has been supporting them since version 2.6, 17 year
+ @Rafael for the APEI/GHES part.
On 2023/9/22 05:52, Bjorn Helgaas wrote:
> On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote:
>> On 2023/9/21 07:02, Bjorn Helgaas wrote:
>>> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
>> ...
>
>>
On 2023/9/21 21:20, David Laight wrote:
> ...
> I've got a target to generate AER errors by generating read cycles
> that are inside the address range that the bridge forwards but
> outside of any BAR because there are 2 different sized BARs.
> (Pretty easy to setup.)
> On the system I was using
On 2023/9/27 07:02, Bjorn Helgaas wrote:
> On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
>> ...
>
>> Actually, this is a question from my colleague from firmware team.
>> The original question is that:
>>
>> "Should I set CPER_SEV_FA
在 2022/10/20 AM1:08, Tony Luck 写道:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
>
> It is easy to set up a test c
在 2022/10/21 AM4:05, Tony Luck 写道:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable erro
在 2022/10/21 PM12:08, Tony Luck 写道:
> On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/21 AM4:05, Tony Luck 写道:
>>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 20
在 2022/10/21 PM12:41, Luck, Tony 写道:
>>> When we do return to user mode the task is going to be busy servicing
>>> a SIGBUS ... so shouldn't try to touch the poison page before the
>>> memory_failure() called by the worker thread cleans things up.
>>
>> What about an RT process on a busy system?
在 2022/10/22 AM12:30, Luck, Tony 写道:
>>> But maybe it is some RMW instruction ... then, if all the above options
>>> didn't happen ... we
>>> could get another machine check from the same address. But then we just
>>> follow the usual
>>> recovery path.
>
>
>> Let assume the instruction that
th the scope.
>
> Part 2 sets up to asynchronously take the page with the uncorrected
> error offline to prevent additional machine check faults. H/t to
> Miaohe Lin and Shuai Xue
> for pointing me to the existing function to queue a call to
> memory_failure().
>
> On x
在 2022/10/23 PM11:52, Shuai Xue 写道:
>
>
> 在 2022/10/22 AM4:01, Tony Luck 写道:
>> Part 1 deals with the process that triggered the copy on write
>> fault with a store to a shared read-only page. That process is
>> send a SIGBUS with the usual machine check decoratio
device that detected the
error. By passing this error device to pcie_do_recovery(), subsequent
patches will be able to accurately access AER status of the error device.
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 30 --
drivers
:34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 3 ++-
drivers/pci/pcie/aer.c | 11
, (Receiver ID)
nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Shuai Xue (2):
PCI/DPC: Run recovery on device that detected the error
PCI/AER
device that detected the
error. By passing this error-detecting device to pcie_do_recovery(), subsequent
patches will be able to accurately access the AER error status.
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 30 --
drivers
)
[ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 50 ++
drivers/pci/pcie/err.c | 6 +
3 files changed, 57 insertions(+)
diff --git a/drivers/pci
[144d:a804] error
status/mask=0010/00504000
[ 414.815305] nvme :34:00.0:[ 4] DLP(First)
[ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message
Shuai Xue (2):
PCI/AER: run recovery on device that detected the error
PCI/AER: report fatal errors of
在 2024/11/7 00:39, Keith Busch 写道:
On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote:
+int aer_get_device_fatal_error_info(struct pci_dev *dev, struct aer_err_info
*info)
+{
+ int type = pci_pcie_type(dev);
+ int aer = dev->aer_cap;
+ u32 aercc;
+
+ pci_i
在 2024/11/7 00:02, Bjorn Helgaas 写道:
On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of an
endpoint or RCiEP that reported a fatal error, considering the link to that
device unreliable. Consequently, when a
在 2024/11/17 21:36, Shuai Xue 写道:
在 2024/11/16 20:44, Shuai Xue 写道:
在 2024/11/16 04:20, Bowman, Terry 写道:
Hi Shuai,
On 11/12/2024 7:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error
在 2024/11/15 17:06, Lukas Wunner 写道:
On Tue, Nov 12, 2024 at 09:54:19PM +0800, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable.
It would be good if
在 2024/11/16 20:44, Shuai Xue 写道:
在 2024/11/16 04:20, Bowman, Terry 写道:
Hi Shuai,
On 11/12/2024 7:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device
在 2024/11/16 04:20, Bowman, Terry 写道:
Hi Shuai,
On 11/12/2024 7:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal
在 2024/11/12 21:54, Shuai Xue 写道:
changes since v1:
- rewrite commit log per Bjorn
- refactor aer_get_device_error_info to reduce duplication per Keith
- fix to avoid reporting fatal errors twice for root and downstream ports per
Keith
The AER driver has historically avoided reading the
acpi_dpc_port_get() locate the port that experienced the containment
event. Rename edev to err_port for clear so that later patch will avoid
misused err_port in pcie_do_recovery().
No functional changes.
Suggested-by: Sathyanarayanan Kuppuswamy
Signed-off-by: Shuai Xue
---
drivers/pci/pcie
device that detected the
error. By passing this error device to pcie_do_recovery(), subsequent
patches will be able to accurately access AER status of the error device.
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 25 +
drivers/pci
:34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 3 ++-
drivers/pci/pcie/aer.c | 11
:03.0: AER: broadcast slot_reset message
Shuai Xue (4):
PCI/DPC: Rename pdev to err_port for dpc_handler
PCI/EDR: Rename edev to err_port for edr_handle_event
PCI/DPC: Run recovery on device that detected the error
PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
drivers/pci/pci.h
The irq handler is registered for error port which recevie DPC
interrupt. Rename pdev to err_port.
No functional changes.
Signed-off-by: Shuai Xue
---
drivers/pci/pcie/dpc.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie
Hi, all,
Gentle ping.
Best Regards,
Shuai
在 2024/12/24 19:03, Shuai Xue 写道:
在 2024/11/12 21:54, Shuai Xue 写道:
changes since v1:
- rewrite commit log per Bjorn
- refactor aer_get_device_error_info to reduce duplication per Keith
- fix to avoid reporting fatal errors twice for root and
在 2025/1/23 12:53, Sathyanarayanan Kuppuswamy 写道:
On 11/12/24 5:54 AM, Shuai Xue wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that
在 2025/1/24 04:10, Sathyanarayanan Kuppuswamy 写道:
Hi,
On 11/12/24 5:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal
:34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 3 ++-
drivers/pci/pcie/aer.c | 11
.
Signed-off-by: Shuai Xue
Reviewed-by: Kuppuswamy Sathyanarayanan
---
drivers/pci/pcie/dpc.c | 10 +-
drivers/pci/pcie/edr.c | 34 +-
2 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index
-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 28
drivers/pci/pcie/edr.c | 7 ---
3 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 01e51db8d285..870d2fbd6ff2 100644
--- a
Link Layer, (Receiver ID)
nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Shuai Xue (3):
PCI/DPC: Clarify naming for error port in DPC Handling
PC
在 2025/2/12 05:17, Sathyanarayanan Kuppuswamy 写道:
On 2/7/25 1:34 AM, Shuai Xue wrote:
The irq handler is registered for error port which recevie DPC
interrupt. Rename pdev to err_port.
No functional changes.
Signed-off-by: Shuai Xue
---
I think you can combine patch 1 & 2 into a si
在 2025/2/12 05:23, Sathyanarayanan Kuppuswamy 写道:
On 2/7/25 1:34 AM, Shuai Xue wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that
在 2025/3/3 12:33, Shuai Xue 写道:
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device
在 2025/3/17 14:02, Shuai Xue 写道:
在 2025/3/3 12:33, Shuai Xue 写道:
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal error
在 2025/3/3 11:36, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that
在 2025/2/17 10:42, Shuai Xue 写道:
changes since v3:
- squash patch 1 and 2 into one patch per Sathyanarayanan
- add comments note for dpc_process_error per Sathyanarayanan
- pick up Reviewed-by tag from Sathyanarayanan
changes since v2:
- moving the "err_port" rename to a separate
在 2025/6/12 18:31, Manivannan Sadhasivam 写道:
On Mon, Feb 17, 2025 at 10:42:17AM +0800, Shuai Xue wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
s/on/for
However, the DPC driver currently passes
Hi, Breno,
在 2025/7/23 00:56, Breno Leitao 写道:
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that did not cause a panic) and record them for vmcore
consumption. This aids post-mortem crash analysis tools by preserving
a count and timestamp for the last oc
在 2025/7/24 21:34, Breno Leitao 写道:
Hello Shuai,
On Thu, Jul 24, 2025 at 04:00:09PM +0800, Shuai Xue wrote:
在 2025/7/23 00:56, Breno Leitao 写道:
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that did not cause a panic) and record them for vmcore
在 2025/7/30 21:11, Breno Leitao 写道:
Hello Shuai,
On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
CPER_SEV_RECOVERABLE errors:
Thanks. I was reading this code a bit more, and I want to make sure my
understandi
在 2025/7/26 00:16, Breno Leitao 写道:
Hello Shuai,
On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote:
APEI does not define an error type named GHES. GHES is just a kernel
driver name. Many hardware error types can be handled in GHES (see
ghes_do_proc), for example, AER is routed by
在 2025/7/29 21:48, Breno Leitao 写道:
On Mon, Jul 28, 2025 at 09:08:25AM +0800, Shuai Xue wrote:
在 2025/7/26 00:16, Breno Leitao 写道:
On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote:
enum hwerr_error_type {
HWERR_RECOV_MCE, // maps to errors in
50 matches
Mail list logo