Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2025-01-08 Thread Shuai Xue
在 2025/1/9 01:59, Bjorn Helgaas 写道: On Wed, Jan 08, 2025 at 05:04:25PM +0800, Shuai Xue wrote: 在 2025/1/8 07:19, Bjorn Helgaas 写道: On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote: Hotplug events are critical indicators for analyzing hardware health, particularly in AI supercompute

Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2025-01-08 Thread Bjorn Helgaas
On Wed, Jan 08, 2025 at 05:04:25PM +0800, Shuai Xue wrote: > 在 2025/1/8 07:19, Bjorn Helgaas 写道: > > On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote: > > > Hotplug events are critical indicators for analyzing hardware health, > > > particularly in AI supercomputers where surprise link dow

Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2025-01-08 Thread Shuai Xue
在 2025/1/8 07:19, Bjorn Helgaas 写道: On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote: Hotplug events are critical indicators for analyzing hardware health, particularly in AI supercomputers where surprise link downs can significantly impact system performance and reliability. The fai

Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2025-01-07 Thread Shuai Xue
在 2025/1/7 20:53, Lukas Wunner 写道: On Tue, Jan 07, 2025 at 07:30:28PM +0800, Shuai Xue wrote: 2024/11/23 19:31, Shuai Xue: To this end, define a new TRACING_SYSTEM named pci, add a generic RAS tracepoint for hotplug event to help healthy check, and generate tracepoints for pcie hotplug event

Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2025-01-07 Thread Bjorn Helgaas
On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote: > Hotplug events are critical indicators for analyzing hardware health, > particularly in AI supercomputers where surprise link downs can > significantly impact system performance and reliability. The failure > characterization analysis ill

Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2025-01-07 Thread Lukas Wunner
On Tue, Jan 07, 2025 at 07:30:28PM +0800, Shuai Xue wrote: > 2024/11/23 19:31, Shuai Xue: > > To this end, define a new TRACING_SYSTEM named pci, add a generic RAS > > tracepoint for hotplug event to help healthy check, and generate > > tracepoints for pcie hotplug event. To monitor these tracepoin

Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2025-01-07 Thread Shuai Xue
在 2024/11/23 19:31, Shuai Xue 写道: Hotplug events are critical indicators for analyzing hardware health, particularly in AI supercomputers where surprise link downs can significantly impact system performance and reliability. The failure characterization analysis illustrates the significance of

[PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

2024-11-23 Thread Shuai Xue
Hotplug events are critical indicators for analyzing hardware health, particularly in AI supercomputers where surprise link downs can significantly impact system performance and reliability. The failure characterization analysis illustrates the significance of failures caused by the Infiniband link