在 2025/3/17 14:02, Shuai Xue 写道:
在 2025/3/3 12:33, Shuai Xue 写道:
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error
在 2025/3/3 12:33, Shuai Xue 写道:
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal error
在 2025/3/3 11:36, Sathyanarayanan Kuppuswamy 写道:
On 2/16/25 6:42 PM, Shuai Xue wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that
在 2025/2/17 10:42, Shuai Xue 写道:
changes since v3:
- squash patch 1 and 2 into one patch per Sathyanarayanan
- add comments note for dpc_process_error per Sathyanarayanan
- pick up Reviewed-by tag from Sathyanarayanan
changes since v2:
- moving the "err_port" rename to a separate
-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 28
drivers/pci/pcie/edr.c | 7 ---
3 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 01e51db8d285..870d2fbd6ff2 100644
--- a
Link Layer, (Receiver ID)
nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Shuai Xue (3):
PCI/DPC: Clarify naming for error port in DPC Handling
PC
:34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 3 ++-
drivers/pci/pcie/aer.c | 11
.
Signed-off-by: Shuai Xue
Reviewed-by: Kuppuswamy Sathyanarayanan
---
drivers/pci/pcie/dpc.c | 10 +-
drivers/pci/pcie/edr.c | 34 +-
2 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index
在 2025/2/12 05:23, Sathyanarayanan Kuppuswamy 写道:
On 2/7/25 1:34 AM, Shuai Xue wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that
在 2025/2/12 05:17, Sathyanarayanan Kuppuswamy 写道:
On 2/7/25 1:34 AM, Shuai Xue wrote:
The irq handler is registered for error port which recevie DPC
interrupt. Rename pdev to err_port.
No functional changes.
Signed-off-by: Shuai Xue
---
I think you can combine patch 1 & 2 into a si
The irq handler is registered for error port which recevie DPC
interrupt. Rename pdev to err_port.
No functional changes.
Signed-off-by: Shuai Xue
---
drivers/pci/pcie/dpc.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie
:03.0: AER: broadcast slot_reset message
Shuai Xue (4):
PCI/DPC: Rename pdev to err_port for dpc_handler
PCI/EDR: Rename edev to err_port for edr_handle_event
PCI/DPC: Run recovery on device that detected the error
PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
drivers/pci/pci.h
:34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 3 ++-
drivers/pci/pcie/aer.c | 11
device that detected the
error. By passing this error device to pcie_do_recovery(), subsequent
patches will be able to accurately access AER status of the error device.
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 25 +
drivers/pci
acpi_dpc_port_get() locate the port that experienced the containment
event. Rename edev to err_port for clear so that later patch will avoid
misused err_port in pcie_do_recovery().
No functional changes.
Suggested-by: Sathyanarayanan Kuppuswamy
Signed-off-by: Shuai Xue
---
drivers/pci/pcie
在 2025/1/24 04:10, Sathyanarayanan Kuppuswamy 写道:
Hi,
On 11/12/24 5:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal
在 2025/1/23 12:53, Sathyanarayanan Kuppuswamy 写道:
On 11/12/24 5:54 AM, Shuai Xue wrote:
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that
Hi, all,
Gentle ping.
Best Regards,
Shuai
在 2024/12/24 19:03, Shuai Xue 写道:
在 2024/11/12 21:54, Shuai Xue 写道:
changes since v1:
- rewrite commit log per Bjorn
- refactor aer_get_device_error_info to reduce duplication per Keith
- fix to avoid reporting fatal errors twice for root and
在 2024/11/12 21:54, Shuai Xue 写道:
changes since v1:
- rewrite commit log per Bjorn
- refactor aer_get_device_error_info to reduce duplication per Keith
- fix to avoid reporting fatal errors twice for root and downstream ports per
Keith
The AER driver has historically avoided reading the
在 2024/11/17 21:36, Shuai Xue 写道:
在 2024/11/16 20:44, Shuai Xue 写道:
在 2024/11/16 04:20, Bowman, Terry 写道:
Hi Shuai,
On 11/12/2024 7:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error
在 2024/11/16 20:44, Shuai Xue 写道:
在 2024/11/16 04:20, Bowman, Terry 写道:
Hi Shuai,
On 11/12/2024 7:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device
在 2024/11/16 04:20, Bowman, Terry 写道:
Hi Shuai,
On 11/12/2024 7:54 AM, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal
在 2024/11/15 17:06, Lukas Wunner 写道:
On Tue, Nov 12, 2024 at 09:54:19PM +0800, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable.
It would be good if
device that detected the
error. By passing this error device to pcie_do_recovery(), subsequent
patches will be able to accurately access AER status of the error device.
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 30 --
drivers
:34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 3 ++-
drivers/pci/pcie/aer.c | 11
, (Receiver ID)
nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message
Shuai Xue (2):
PCI/DPC: Run recovery on device that detected the error
PCI/AER
在 2024/11/7 00:39, Keith Busch 写道:
On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote:
+int aer_get_device_fatal_error_info(struct pci_dev *dev, struct aer_err_info
*info)
+{
+ int type = pci_pcie_type(dev);
+ int aer = dev->aer_cap;
+ u32 aercc;
+
+ pci_i
在 2024/11/7 00:02, Bjorn Helgaas 写道:
On Wed, Nov 06, 2024 at 05:03:39PM +0800, Shuai Xue wrote:
The AER driver has historically avoided reading the configuration space of an
endpoint or RCiEP that reported a fatal error, considering the link to that
device unreliable. Consequently, when a
[144d:a804] error
status/mask=0010/00504000
[ 414.815305] nvme :34:00.0:[ 4] DLP(First)
[ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message
Shuai Xue (2):
PCI/AER: run recovery on device that detected the error
PCI/AER: report fatal errors of
)
[ 414.821768] pcieport :30:03.0: AER: broadcast slot_reset message
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 50 ++
drivers/pci/pcie/err.c | 6 +
3 files changed, 57 insertions(+)
diff --git a/drivers/pci
device that detected the
error. By passing this error-detecting device to pcie_do_recovery(), subsequent
patches will be able to accurately access the AER error status.
Signed-off-by: Shuai Xue
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 30 --
drivers
On 2023/9/27 07:02, Bjorn Helgaas wrote:
> On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
>> ...
>
>> Actually, this is a question from my colleague from firmware team.
>> The original question is that:
>>
>> "Should I set CPER_SEV_FA
On 2023/9/21 21:20, David Laight wrote:
> ...
> I've got a target to generate AER errors by generating read cycles
> that are inside the address range that the bridge forwards but
> outside of any BAR because there are 2 different sized BARs.
> (Pretty easy to setup.)
> On the system I was using
+ @Rafael for the APEI/GHES part.
On 2023/9/22 05:52, Bjorn Helgaas wrote:
> On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote:
>> On 2023/9/21 07:02, Bjorn Helgaas wrote:
>>> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
>> ...
>
>>
On 2023/9/21 07:02, Bjorn Helgaas wrote:
> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
>> Hi, all folks,
>>
>> Error reporting and recovery are one of the important features of PCIe, and
>> the kernel has been supporting them since version 2.6, 17 year
Hi, all folks,
Error reporting and recovery are one of the important features of PCIe, and
the kernel has been supporting them since version 2.6, 17 years ago.
I am very curious about the expected behavior of the software.
I first recap the error classification and then list my questions bellow it
在 2022/10/23 PM11:52, Shuai Xue 写道:
>
>
> 在 2022/10/22 AM4:01, Tony Luck 写道:
>> Part 1 deals with the process that triggered the copy on write
>> fault with a store to a shared read-only page. That process is
>> send a SIGBUS with the usual machine check decoratio
th the scope.
>
> Part 2 sets up to asynchronously take the page with the uncorrected
> error offline to prevent additional machine check faults. H/t to
> Miaohe Lin and Shuai Xue
> for pointing me to the existing function to queue a call to
> memory_failure().
>
> On x
在 2022/10/22 AM12:30, Luck, Tony 写道:
>>> But maybe it is some RMW instruction ... then, if all the above options
>>> didn't happen ... we
>>> could get another machine check from the same address. But then we just
>>> follow the usual
>>> recovery path.
>
>
>> Let assume the instruction that
在 2022/10/21 PM12:41, Luck, Tony 写道:
>>> When we do return to user mode the task is going to be busy servicing
>>> a SIGBUS ... so shouldn't try to touch the poison page before the
>>> memory_failure() called by the worker thread cleans things up.
>>
>> What about an RT process on a busy system?
在 2022/10/21 PM12:08, Tony Luck 写道:
> On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/21 AM4:05, Tony Luck 写道:
>>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 20
在 2022/10/21 AM4:05, Tony Luck 写道:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable erro
在 2022/10/20 AM1:08, Tony Luck 写道:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
>
> It is easy to set up a test c
44 matches
Mail list logo