On 11.02.2025 09:50, Roger Pau Monné wrote: > On Tue, Feb 11, 2025 at 07:39:12AM +0100, Jan Beulich wrote: >> On 06.02.2025 16:06, Roger Pau Monne wrote: >>> The following series aims to prevent local APIC errors from stalling the >>> shtudown process. On XenServer testing we have seen reports of AMD >>> boxes sporadically getting stuck in a spam of: >>> >>> APIC error on CPU0: 00(08), Receive accept error >>> >>> Messages during shutdown, as a result of device interrupts targeting >>> CPUs that are offline (and have the local APIC disabled). >> >> One more thought here: Have you/we perhaps discovered the reason why there >> was that 1ms delay at the end of fixup_irqs() that was badly commented, >> and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts >> to new destination in fixup_irqs()")? May be worth mentioning that by way >> of a Fixes: tag. > > Hm, so you think the delay was added there as a way to ensure any > pending interrupts would get drained (ie: serviced) on the old target?
So far I didn't have the slightest idea why that call had been there. This at least gives a possible reason. > I'm maybe a bit confused, but I don't think the delay would help much > with preventing the local APIC errors? Regardless of the wait, if the > interrupts target offline CPUs there's a chance receive accept errors > will be triggered on AMD. But fixup_irqs() right now runs ahead of actually offlining the APs. Jan