On 2/25/22 05:22, Michael S. Tsirkin wrote: > On Thu, Feb 24, 2022 at 08:34:40PM +0000, Joao Martins wrote: >> On 2/24/22 20:12, Michael S. Tsirkin wrote: >>> On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote: >>>> On 2/24/22 19:54, Michael S. Tsirkin wrote: >>>>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote: >>>>>> On 2/24/22 18:30, Michael S. Tsirkin wrote: >>>>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote: >>>>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote: >>>>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote: >>>>>>>>>> On 2/23/22 23:35, Joao Martins wrote: >>>>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote: >>>>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms, >>>>>>>>>>>>> + uint64_t >>>>>>>>>>>>> pci_hole64_size) >>>>>>>>>>>>> +{ >>>>>>>>>>>>> + X86MachineState *x86ms = X86_MACHINE(pcms); >>>>>>>>>>>>> + uint32_t eax, vendor[3]; >>>>>>>>>>>>> + >>>>>>>>>>>>> + host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); >>>>>>>>>>>>> + if (!IS_AMD_VENDOR(vendor)) { >>>>>>>>>>>>> + return; >>>>>>>>>>>>> + } >>>>>>>>>>>> >>>>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU >>>>>>>>>>>> ID? >>>>>>>>>>>> It's really about what we present to the guest though, >>>>>>>>>>>> isn't it? >>>>>>>>>>> >>>>>>>>>>> It was the easier catch all to use cpuid without going into >>>>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is >>>>>>>>>>> only >>>>>>>>>>> for systems with an IOMMU present. >>>>>>>>>>> >>>>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present? >>>>>>>>>>>> >>>>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() >>>>>>>>>>> helper >>>>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child >>>>>>>>>>> entries >>>>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT >>>>>>>>>>> region is >>>>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I >>>>>>>>>>> don't think it's >>>>>>>>>>> even worth checking the range exists in: >>>>>>>>>>> >>>>>>>>>>> /sys/kernel/iommu_groups/0/reserved_regions >>>>>>>>>>> >>>>>>>>>>> (Also that sysfs ABI is >= 4.11 only) >>>>>>>>>> >>>>>>>>>> Here's what I have staged in local tree, to address your comment. >>>>>>>>>> >>>>>>>>>> Naturally the first chunk is what's affected by this patch the rest >>>>>>>>>> is a >>>>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to >>>>>>>>>> pass >>>>>>>>>> all the tests and what not. >>>>>>>>>> >>>>>>>>>> I am not entirely sure this is the right place to put such a helper, >>>>>>>>>> open >>>>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow >>>>>>>>>> the rest >>>>>>>>>> of the file's style. >>>>>>>>>>
[snip] >>>>>>>>> >>>>>>>>> why are we checking whether an AMD IOMMU is present >>>>>>>>> on the host? >>>>>>>>> Isn't what we care about whether it's >>>>>>>>> present in the VM? What we are reserving after all >>>>>>>>> is a range of GPAs, not actual host PA's ... >>>>>>>>> >>>>>>>> Right. >>>>>>>> >>>>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail, >>>>>>>> and so we need to not map that portion of address space entirely >>>>>>>> and use some other portion of the GPA-space. This has to >>>>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes >>>>>>>> place. So, if you do not have an host IOMMU, you can't >>>>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, >>>>>>>> therefore you >>>>>>>> don't need this problem. >>>>>>>> >>>>>>>> Whether one uses a guest IOMMU or not does not affect the result, >>>>>>>> it would be more of a case of mimicking real hardware not fixing >>>>>>>> the issue of this series. >>>>>>> >>>>>>> >>>>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then. >>>>>>> And ideally move the logic reporting reserved ranges there too. >>>>>>> However, I think vdpa has the same issue too. >>>>>>> CC Jason for an opinion. >>>>>> >>>>>> It does have the same problem. >>>>>> >>>>>> This is not specific to VFIO, it's to anything that uses the iommu. >>>>> >>>>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate >>>>> that we don't have a global "enable-vfio" flag since vfio devices >>>>> can be hot-plugged. I think this is not the first time we wanted >>>>> something like this, right Alex? >>>>> >>>>>> It's >>>>>> just that VFIO doesn't let you shoot in the foot :) >>>>>> >>>>>> vDPA would need to validate iova ranges as well to prevent mapping on >>>>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma >>>>>> map request is within a valid iova range"). Now you could argue that >>>>>> it's an IOMMU core problem, maybe. >>>>>> >>>>>> But I this as a separate problem, >>>>>> because regardless of said validation, your guest provisioning >>>>>> is still going to fail for guests with >=1010G of max GPA and even if >>>>>> it doesn't fail you shouldn't be letting it DMA map those in the first >>>>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events >>>>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole). >>>>> >>>>> I wonder what's the status on a system without an IOMMU. >>>>> >>>> In theory you should be OK. Also it's worth keeping in mind that AMD >>>> machines >>>> with >=1T have this 12G hole marked as Reserved, so even DMA at last when >>>> CPU >>>> is the initiator should be fine on worst case scenario. >>> >>> Not sure I understand this last sentence. >>> >> I was trying to say that the E820 from hardware is marked as Reserved and >> any DMA >> from/to endpoints from kernel-supplied CPU addresses will use those reserved >> ranges. > > meaning "will *not* use" I guess? > Yes, correct. Sorry, I missed a word there. Happened quite a lot these days to me :( >>>>>>> Also, some concerns >>>>>>> - I think this changes memory layout for working VMs not using VFIO. >>>>>>> Better preserve the old layout for old machine types? >>>>>>> >>>>>> Oh definitely, and I do that in this series. See the last commit, and >>>>>> in the past it was also a machine-property exposed to userspace. >>>>>> Otherwise I would be breaking (badly) compat/migration. >>>>>> >>>>>> I would like to emphasize that these memory layout changes are >>>>>> *exclusive* and >>>>>> *only* for hosts on AMD with guests memory being bigger than >>>>>> ~950G-~1010G. >>>>>> It's not every guest, but a fairly limited subset of big-memory guests >>>>>> that >>>>>> are not the norm. >>>>>> >>>>>> Albeit the phys-bits property errors might gives us a bit more trouble, >>>>>> so >>>>>> it might help being more conservative. >>>>>> >>>>>>> - You mention the phys bits issue very briefly, and it's pretty >>>>>>> concerning. Do we maybe want to also disable the work around if phys >>>>>>> bits is too small? >>>>>> >>>>>> We are doing that here (well, v4), as suggested by Igor. Note that in >>>>>> this series >>>>>> it's a warning, but v4 I have it as an error and with the 32-bit issues >>>>>> that >>>>>> I found through qtest. >>>>>> >>>>>> I share the same concern as you over making this an error because of >>>>>> compatibility. >>>>>> Perhaps, we could go back to the previous version and turn back into a >>>>>> warning and >>>>>> additionally even disabling the relocation all together. Or if desired >>>>>> even >>>>>> depending on a machine-property to enable it. >>>>> >>>>> I would be inclined to disable the relocation. And maybe block vfio? I'm >>>>> not 100% sure about that last one, but this can be a vfio decision to >>>>> make. >>>>> >>>> I don't see this as a VFIO problem (VFIO is actually being a good citizen >>>> and doing the >>>> right thing :)). From my perspective this fix is also useful to vDPA >>>> (which we also care), >>>> and thus will also let us avoid DMA mapping in these GPA ranges. >>>> >>>> The reason I mention VFIO in cover letter is that it's one visible UAPI >>>> change that >>>> users will notice that suddenly their 1TB+ VMs break. >>> >>> What I mean is that most VMs don't use either VFIO or VDPA. >>> If we had VFIO,VDPA as an accelerator that needs to be listed >>> upfront when qemu is started, we could solve this with >>> a bit less risk by not changing anything for VMs without these two. >>> >> That wouldn't work for the vfio/vdpa hotplug case (which we do use), >> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells >> you what are those allowed IOVA ranges when you create the container. >> >> The machine property, though, could communicate /the intent/ to add >> any VFIO or vDPA devices in the future and maybe cover for that. That >> let's one avoid any 'accelerator-only' problems and restrict the issues >> of this series to VMs with VFIO/VDPA if the risk is a concern. > > Well Alex nacked that idea (and I certainly trust him where VFIO is > concerned), I guess for now we'll just live with the risk. > My reading was that he nacked a 'VFIO-only' global property not necessarily the machine property for valid-iova. Hmm, I might be misreading it as at the end of the day the result will lead to the same thing.