On 11.02.2025 09:50, Roger Pau Monné wrote:
> On Tue, Feb 11, 2025 at 07:39:12AM +0100, Jan Beulich wrote:
>> On 06.02.2025 16:06, Roger Pau Monne wrote:
>>> The following series aims to prevent local APIC errors from stalling the
>>> shtudown process.  On XenServer testing we have seen reports of AMD
>>> boxes sporadically getting stuck in a spam of:
>>>
>>> APIC error on CPU0: 00(08), Receive accept error
>>>
>>> Messages during shutdown, as a result of device interrupts targeting
>>> CPUs that are offline (and have the local APIC disabled).
>>
>> One more thought here: Have you/we perhaps discovered the reason why there
>> was that 1ms delay at the end of fixup_irqs() that was badly commented,
>> and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts
>> to new destination in fixup_irqs()")? May be worth mentioning that by way
>> of a Fixes: tag.
> 
> Hm, so you think the delay was added there as a way to ensure any
> pending interrupts would get drained (ie: serviced) on the old target?

So far I didn't have the slightest idea why that call had been there. This
at least gives a possible reason.

> I'm maybe a bit confused, but I don't think the delay would help much
> with preventing the local APIC errors?  Regardless of the wait, if the
> interrupts target offline CPUs there's a chance receive accept errors
> will be triggered on AMD.

But fixup_irqs() right now runs ahead of actually offlining the APs.

Jan

Reply via email to