pc: relocate 4g start to 1T where applicable

Joao Martins Fri, 25 Feb 2022 04:43:13 -0800

On 2/25/22 05:22, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 08:34:40PM +0000, Joao Martins wrote:
>> On 2/24/22 20:12, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
>>>> On 2/24/22 19:54, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
>>>>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>>>>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>>>>>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>>>>>> +                                          uint64_t 
>>>>>>>>>>>>> pci_hole64_size)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>>>>>> +        return;
>>>>>>>>>>>>> +    }
>>>>>>>>>>>>
>>>>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU 
>>>>>>>>>>>> ID?
>>>>>>>>>>>> It's really about what we present to the guest though,
>>>>>>>>>>>> isn't it?
>>>>>>>>>>>
>>>>>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is 
>>>>>>>>>>> only
>>>>>>>>>>> for systems with an IOMMU present.
>>>>>>>>>>>
>>>>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>>>>>
>>>>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() 
>>>>>>>>>>> helper
>>>>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child 
>>>>>>>>>>> entries
>>>>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT 
>>>>>>>>>>> region is
>>>>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I 
>>>>>>>>>>> don't think it's
>>>>>>>>>>> even worth checking the range exists in:
>>>>>>>>>>>
>>>>>>>>>>>     /sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>>>>>
>>>>>>>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>>>>>>>
>>>>>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>>>>>
>>>>>>>>>> Naturally the first chunk is what's affected by this patch the rest 
>>>>>>>>>> is a
>>>>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
>>>>>>>>>> pass
>>>>>>>>>> all the tests and what not.
>>>>>>>>>>
>>>>>>>>>> I am not entirely sure this is the right place to put such a helper, 
>>>>>>>>>> open
>>>>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow 
>>>>>>>>>> the rest
>>>>>>>>>> of the file's style.
>>>>>>>>>>


[snip]

>>>>>>>>>
>>>>>>>>> why are we checking whether an AMD IOMMU is present
>>>>>>>>> on the host? 
>>>>>>>>> Isn't what we care about whether it's
>>>>>>>>> present in the VM? What we are reserving after all
>>>>>>>>> is a range of GPAs, not actual host PA's ...
>>>>>>>>>
>>>>>>>> Right.
>>>>>>>>
>>>>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>>>>>> and so we need to not map that portion of address space entirely
>>>>>>>> and use some other portion of the GPA-space. This has to
>>>>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>>>>>> place. So, if you do not have an host IOMMU, you can't
>>>>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, 
>>>>>>>> therefore you
>>>>>>>> don't need this problem.
>>>>>>>>
>>>>>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>>>>>> it would be more of a case of mimicking real hardware not fixing
>>>>>>>> the issue of this series.
>>>>>>>
>>>>>>>
>>>>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>>>>>> And ideally move the logic reporting reserved ranges there too.
>>>>>>> However, I think vdpa has the same issue too.
>>>>>>> CC Jason for an opinion.
>>>>>>
>>>>>> It does have the same problem.
>>>>>>
>>>>>> This is not specific to VFIO, it's to anything that uses the iommu.
>>>>>
>>>>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
>>>>> that we don't have a global "enable-vfio" flag since vfio devices
>>>>> can be hot-plugged. I think this is not the first time we wanted
>>>>> something like this, right Alex?
>>>>>
>>>>>> It's
>>>>>> just that VFIO doesn't let you shoot in the foot :)
>>>>>>
>>>>>> vDPA would need to validate iova ranges as well to prevent mapping on
>>>>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>>>>>> map request is within a valid iova range"). Now you could argue that
>>>>>> it's an IOMMU core problem, maybe.
>>>>>>
>>>>>> But I this as a separate problem,
>>>>>> because regardless of said validation, your guest provisioning
>>>>>> is still going to fail for guests with >=1010G of max GPA and even if
>>>>>> it doesn't fail you shouldn't be letting it DMA map those in the first
>>>>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
>>>>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
>>>>>
>>>>> I wonder what's the status on a system without an IOMMU.
>>>>>
>>>> In theory you should be OK. Also it's worth keeping in mind that AMD 
>>>> machines
>>>> with >=1T have this 12G hole marked as Reserved, so even DMA at last when 
>>>> CPU
>>>> is the initiator should be fine on worst case scenario.
>>>
>>> Not sure I understand this last sentence.
>>>
>> I was trying to say that the E820 from hardware is marked as Reserved and 
>> any DMA
>> from/to endpoints from kernel-supplied CPU addresses will use those reserved 
>> ranges.
> 
> meaning "will *not* use" I guess?
> 
Yes, correct.

Sorry, I missed a word there. Happened quite a lot these days to me :(

>>>>>>> Also, some concerns
>>>>>>> - I think this changes memory layout for working VMs not using VFIO.
>>>>>>>   Better preserve the old layout for old machine types?
>>>>>>>
>>>>>> Oh definitely, and I do that in this series. See the last commit, and
>>>>>> in the past it was also a machine-property exposed to userspace.
>>>>>> Otherwise I would be breaking (badly) compat/migration.
>>>>>>
>>>>>> I would like to emphasize that these memory layout changes are 
>>>>>> *exclusive* and
>>>>>> *only* for hosts on AMD with guests memory being bigger than 
>>>>>> ~950G-~1010G.
>>>>>> It's not every guest, but a fairly limited subset of big-memory guests 
>>>>>> that
>>>>>> are not the norm.
>>>>>>
>>>>>> Albeit the phys-bits property errors might gives us a bit more trouble, 
>>>>>> so
>>>>>> it might help being more conservative.
>>>>>>
>>>>>>> - You mention the phys bits issue very briefly, and it's pretty
>>>>>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>>>>>   bits is too small? 
>>>>>>
>>>>>> We are doing that here (well, v4), as suggested by Igor. Note that in 
>>>>>> this series
>>>>>> it's a warning, but v4 I have it as an error and with the 32-bit issues 
>>>>>> that
>>>>>> I found through qtest.
>>>>>>
>>>>>> I share the same concern as you over making this an error because of 
>>>>>> compatibility.
>>>>>> Perhaps, we could go back to the previous version and turn back into a 
>>>>>> warning and
>>>>>> additionally even disabling the relocation all together. Or if desired 
>>>>>> even
>>>>>> depending on a machine-property to enable it.
>>>>>
>>>>> I would be inclined to disable the relocation. And maybe block vfio? I'm
>>>>> not 100% sure about that last one, but this can be a vfio decision to
>>>>> make.
>>>>>
>>>> I don't see this as a VFIO problem (VFIO is actually being a good citizen 
>>>> and doing the
>>>> right thing :)). From my perspective this fix is also useful to vDPA 
>>>> (which we also care),
>>>> and thus will also let us avoid DMA mapping in these GPA ranges.
>>>>
>>>> The reason I mention VFIO in cover letter is that it's one visible UAPI 
>>>> change that
>>>> users will notice that suddenly their 1TB+ VMs break.
>>>
>>> What I mean is that most VMs don't use either VFIO or VDPA.
>>> If we had VFIO,VDPA as an accelerator that needs to be listed
>>> upfront when qemu is started, we could solve this with
>>> a bit less risk by not changing anything for VMs without these two.
>>>
>> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
>> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
>> you what are those allowed IOVA ranges when you create the container.
>>
>> The machine property, though, could communicate /the intent/ to add
>> any VFIO or vDPA devices in the future and maybe cover for that. That
>> let's one avoid any 'accelerator-only' problems and restrict the issues
>> of this series to VMs with VFIO/VDPA if the risk is a concern.
> 
> Well Alex nacked that idea (and I certainly trust him where VFIO is
> concerned), I guess for now we'll just live with the risk.
> 

My reading was that he nacked a 'VFIO-only' global property not necessarily
the machine property for valid-iova. Hmm, I might be misreading it as at the
end of the day the result will lead to the same thing.

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

Reply via email to