在 2025/1/9 01:59, Bjorn Helgaas 写道:
On Wed, Jan 08, 2025 at 05:04:25PM +0800, Shuai Xue wrote:
在 2025/1/8 07:19, Bjorn Helgaas 写道:
On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote:
Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercompute
On Wed, Jan 08, 2025 at 05:04:25PM +0800, Shuai Xue wrote:
> 在 2025/1/8 07:19, Bjorn Helgaas 写道:
> > On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote:
> > > Hotplug events are critical indicators for analyzing hardware health,
> > > particularly in AI supercomputers where surprise link dow
在 2025/1/8 07:19, Bjorn Helgaas 写道:
On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote:
Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability. The fai
在 2025/1/7 20:53, Lukas Wunner 写道:
On Tue, Jan 07, 2025 at 07:30:28PM +0800, Shuai Xue wrote:
2024/11/23 19:31, Shuai Xue:
To this end, define a new TRACING_SYSTEM named pci, add a generic RAS
tracepoint for hotplug event to help healthy check, and generate
tracepoints for pcie hotplug event
On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote:
> Hotplug events are critical indicators for analyzing hardware health,
> particularly in AI supercomputers where surprise link downs can
> significantly impact system performance and reliability. The failure
> characterization analysis ill
On Tue, Jan 07, 2025 at 07:30:28PM +0800, Shuai Xue wrote:
> 2024/11/23 19:31, Shuai Xue:
> > To this end, define a new TRACING_SYSTEM named pci, add a generic RAS
> > tracepoint for hotplug event to help healthy check, and generate
> > tracepoints for pcie hotplug event. To monitor these tracepoin
在 2024/11/23 19:31, Shuai Xue 写道:
Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability. The failure
characterization analysis illustrates the significance of
Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability. The failure
characterization analysis illustrates the significance of failures
caused by the Infiniband link