On 03/02/2025 9:09 am, Jan Beulich wrote:
> On 02.02.2025 15:46, Andrew Cooper wrote:
>> On 30/01/2025 11:12 am, Jan Beulich wrote:
>>> In order for amd_iommu_detect_one_acpi()'s call to pci_ro_device() to
>>> have permanent effect, pci_segments_init() needs to be called ahead of
>>> making it there. Without this we're losing segment 0's r/o map, and thus
>>> we're losing write-protection of the PCI devices representing IOMMUs.
>>> Which in turn means that half-way recent Linux Dom0 will, as it boots,
>>> turn off MSI on these devices, thus preventing any IOMMU events (faults
>>> in particular) from being reported on pre-x2APIC hardware.
>>>
>>> As the acpi_iommu_init() invocation was moved ahead of
>>> acpi_mmcfg_init()'s by the offending commit, move the call to
>>> pci_segments_init() accordingly.
>>>
>>> Fixes: 3950f2485bbc ("x86/x2APIC: defer probe until after IOMMU ACPI table 
>>> parsing")
>>> Signed-off-by: Jan Beulich <jbeul...@suse.com>
>>> ---
>>> Of course it would have been quite a bit easier to notice this issue if
>>> radix_tree_insert() wouldn't work fine ahead of radix_tree_init() being
>>> invoked for a given radix tree, when the index inserted at is 0.
>>>
>>> While hunting down various other dead paths to actually find the root
>>> cause, it occurred to me that it's probably not a good idea to fully
>>> disallow config space writes for r/o devices: Dom0 won't be able to size
>>> their BARs (luckily the IOMMU "devices" don't have any, but e.g. serial
>>> ones generally will have at least one), for example. Without being able
>>> to size BARs it also will likely be unable to correctly account for the
>>> address space taken by these BARs. However, outside of vPCI it's not
>>> really clear to me how we could reasonably emulate such BAR sizing
>>> writes - we can't, after all, allow Dom0 to actually write to the
>>> underlying physical registers, yet we don't intercept reads (i.e. we
>>> can't mimic expected behavior then).
>>>
>>> --- a/xen/arch/x86/x86_64/mmconfig-shared.c
>>> +++ b/xen/arch/x86/x86_64/mmconfig-shared.c
>>> @@ -402,8 +402,6 @@ void __init acpi_mmcfg_init(void)
>>>  {
>>>      bool valid = true;
>>>  
>>> -    pci_segments_init();
>>> -
>>>      /* MMCONFIG disabled */
>>>      if ((pci_probe & PCI_PROBE_MMCONF) == 0)
>>>          return;
>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>> @@ -55,6 +55,8 @@ void __init acpi_iommu_init(void)
>>>  {
>>>      int ret = -ENODEV;
>>>  
>>> +    pci_segments_init();
>>> +
>>>      if ( !iommu_enable && !iommu_intremap )
>>>          return;
>>>  
>>>
>> I can't help but feel this is taking a bad problem and not making it any
>> better.
>>
>> pci_segments_init() is even less (obviously) relevant in
>> apci_iommu_init() than it is in acpi_mmcfg_init(), and given the
>> fine-grain Kconfig-ing going on, is only one small step from
>> accidentally being compiled out.
> The alternative I did consider was to move the call into __start_xen()
> itself. Anything going beyond that looks more intrusive than we'd like
> it at this point of the release cycle.

Moving into __start_xen() would be ok if we think we're getting too
close to the release.  It makes it clearer that there is explicit
ordering necessary.

>
>> ARM is in a bad state too, with this initialisation even being behind
>> the PCI Passthrough cmdline option.
>>
>> IMO there are two problems here; one as you pointed out
>> (radix_tree_insert() doesn't fail), and that PCI handling requires
>> explicit initialisation to begin with.
>>
>> Looking through radix tree, it wouldn't be hard to create a
>> RADIX_TREE_INIT macro to allow initialisation at compile time for
>> suitable objects (pci_segments and acpi_ivrs currently).
>>
>> That involves exporting rcu_node_{alloc,free}(), although the last
>> caller of radix_tree_set_alloc_callbacks() was dropped when TMEM went,
>> so we could reasonably remove that infrastructure too, at which point
>> radix_tree_init() is a simple zero of the structure.
> Yes, seeing that this was even an extension of ours (i.e. Linux doesn't
> have such), it's certainly worth getting rid of. If nothing else, then
> for the two cf_check annotations that's we'd then be able to drop. I'll
> make a patch.

Oh, even better. 

>
>> Dealing with alloc_pseg(0) is harder.  As we never free the PCI
>> segments, we could just opencode the radix_tree_root of height=1 with a
>> static pseg0 structure, and that would drop the need for
>> pci_segemnts_init() completely.
> I'm afraid this would end up being too much open-coding for my taste.

I didn't much like the suggestion either.

> I'd put this differently: Unlike the radix tree initialization, the
> setting up of segment 0 isn't a prereq to acpi_iommu_init(). We could
> keep acpi_mmcfg_init() doing that, by way of calling pci_add_segment(0)
> (and that would simply be a no-op if acpi_iommu_init() ended up
> introducing segment 0 already).

That might be ok.

~Andrew

Reply via email to