On 19.06.2024 09:05, Roger Pau Monné wrote:
> On Tue, Jun 18, 2024 at 06:30:22PM +0200, Jan Beulich wrote:
>> On 18.06.2024 16:50, Roger Pau Monné wrote:
>>> On Tue, Jun 18, 2024 at 04:34:50PM +0200, Jan Beulich wrote:
>>>> On 18.06.2024 13:30, Roger Pau Monné wrote:
>>>>> On Mon, Jun 17, 2024 at 03:41:12PM +0200, Jan Beulich wrote:
>>>>>> On 13.06.2024 18:56, Roger Pau Monne wrote:
>>>>>>> @@ -2686,11 +2705,27 @@ void fixup_irqs(const cpumask_t *mask, bool 
>>>>>>> verbose)
>>>>>>>          if ( desc->handler->disable )
>>>>>>>              desc->handler->disable(desc);
>>>>>>>  
>>>>>>> +        /*
>>>>>>> +         * If the current CPU is going offline and is (one of) the 
>>>>>>> target(s) of
>>>>>>> +         * the interrupt, signal to check whether there are any 
>>>>>>> pending vectors
>>>>>>> +         * to be handled in the local APIC after the interrupt has 
>>>>>>> been moved.
>>>>>>> +         */
>>>>>>> +        if ( !cpu_online(cpu) && cpumask_test_cpu(cpu, 
>>>>>>> desc->arch.cpu_mask) )
>>>>>>> +            check_irr = true;
>>>>>>> +
>>>>>>>          if ( desc->handler->set_affinity )
>>>>>>>              desc->handler->set_affinity(desc, affinity);
>>>>>>>          else if ( !(warned++) )
>>>>>>>              set_affinity = false;
>>>>>>>  
>>>>>>> +        if ( check_irr && apic_irr_read(vector) )
>>>>>>> +            /*
>>>>>>> +             * Forward pending interrupt to the new destination, this 
>>>>>>> CPU is
>>>>>>> +             * going offline and otherwise the interrupt would be lost.
>>>>>>> +             */
>>>>>>> +            send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
>>>>>>> +                          desc->arch.vector);
>>>>>>
>>>>>> Hmm, IRR may become set right after the IRR read (unlike in the other 
>>>>>> cases,
>>>>>> where new IRQs ought to be surfacing only at the new destination). 
>>>>>> Doesn't
>>>>>> this want moving ...
>>>>>>
>>>>>>>          if ( desc->handler->enable )
>>>>>>>              desc->handler->enable(desc);
>>>>>>
>>>>>> ... past the actual affinity change?
>>>>>
>>>>> Hm, but the ->enable() hook is just unmasking the interrupt, the
>>>>> actual affinity change is done in ->set_affinity(), and hence after
>>>>> the call to ->set_affinity() no further interrupts should be delivered
>>>>> to the CPU regardless of whether the source is masked?
>>>>>
>>>>> Or is it possible for the device/interrupt controller to not switch to
>>>>> use the new destination until the interrupt is unmasked, and hence
>>>>> could have pending masked interrupts still using the old destination?
>>>>> IIRC For MSI-X it's required that the device updates the destination
>>>>> target once the entry is unmasked.
>>>>
>>>> That's all not relevant here, I think. IRR gets set when an interrupt is
>>>> signaled, no matter whether it's masked.
>>>
>>> I'm kind of lost here, what does signaling mean in this context?
>>>
>>> I would expect the interrupt vector to not get set in IRR if the MSI-X
>>> entry is masked, as at that point the state of the address/data fields
>>> might not be consistent (that's the whole point of masking it right?)
>>>
>>>> It's its handling which the
>>>> masking would prevent, i.e. the "moving" of the set bit from IRR to ISR.
>>>
>>> My understanding was that the masking would prevent the message write to
>>> the APIC from happening, and hence no vector should get set in IRR.
>>
>> Hmm, yes, looks like I was confused. The masking is at the source side
>> (IO-APIC RTE, MSI-X entry, or - if supported - in the MSI capability).
>> So the sole case to worry about is MSI without mask-bit support then.
> 
> Yeah, and for MSI without masking bit support we don't care doing the
> IRR check before or after the ->enable() hook, as that's a no-op in
> that case.  The write to the MSI address/data fields has already been
> done, and hence the issue would be exclusively with draining any
> in-flight writes to the APIC doorbell (what you mention below).

Except that both here ...

>>>> Plus we run with IRQs off here anyway if I'm not mistaken, so no
>>>> interrupt can be delivered to the local CPU. IOW whatever IRR bits it
>>>> has set (including ones becoming set between the IRR read and the actual
>>>> vector change), those would never be serviced. Hence the reading of the
>>>> bit ought to occur after the vector change: It's only then that we know
>>>> the IRR bit corresponding to the old vector can't become set anymore.
>>>
>>> Right, and the vector change happens in ->set_affinity(), not
>>> ->enable().  See for example set_msi_affinity() and the
>>> write_msi_msg(), that's where the vector gets changed.
>>>
>>>> And even then we're assuming that no interrupt signals might still be
>>>> "on their way" from the IO-APIC or a posted MSI message write by a
>>>> device to the LAPIC (I have no idea how to properly fence that, or
>>>> whether there are guarantees for this to never occur).
>>>
>>> Yeah, those I expect would be completed in the window between the
>>> write of the new vector/destination and the reading of IRR.
>>
>> Except we have no idea on the latencies.
> 
> There isn't much else we can do? Even the current case where we add
> the 1ms window at the end of the shuffling could still suffer from
> this issue because we don't know the latencies.  IOW: I don't think
> this is any worse than the current approach.

... and here, the later we read IRR, the better the chances we don't miss
anything. Even the no-op ->enable() isn't a no-op execution-wise. In fact
it (quite pointlessly[1]) is an indirect call to irq_enable_none(). I'm
actually inclined to suggest that we try to even further delay the IRR
read, certainly past the cpumask_copy(), maybe even past the spin_unlock()
(latching CPU and vector into local variables, along with the latching of
->affinity that's already there).

Jan

[1] While back when that was written the main goal probably was to avoid
conditionals on what may be deemed fast paths, I wonder whether nowadays
the main goal wouldn't be to avoid indirect calls when we (pretty) easily
can.

Reply via email to