On Wed, Nov 24, 2021 at 4:52 PM Peter Xu <pet...@redhat.com> wrote: > > On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote: > > On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <pet...@redhat.com> wrote: > > > > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote: > > > > When booting with scalable mode, I hit this error: > > > > > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero > > > > iova=0xfffff002, level=0x1slpte=0x102681803) > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure > > > > (dev=01:00:00, iova=0xfffff002) > > > > qemu-system-x86_64: New fault is not recorded due to compression of > > > > faults > > > > > > > > This is because the SNP bit is set since Linux kernel commit > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using > > > > FL for IOVA") where SNP bit is set if scalable mode is on though this > > > > seems to be an violation on the spec which said the SNP bit is > > > > considered to be reserved if SC is not supported. > > > > > > When I was reading that commit, I was actually confused by this change: > > > > > > ---8<--- > > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > > > index 956a02eb40b4..0ee5f1bd8af2 100644 > > > --- a/drivers/iommu/intel/iommu.c > > > +++ b/drivers/iommu/intel/iommu.c > > > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct > > > intel_iommu *skip) > > > rcu_read_lock(); > > > for_each_active_iommu(iommu, drhd) { > > > if (iommu != skip) { > > > - if (!ecap_sc_support(iommu->ecap)) { > > > + /* > > > + * If the hardware is operating in the scalable > > > mode, > > > + * the snooping control is always supported since > > > we > > > + * always set PASID-table-entry.PGSNP bit if the > > > domain > > > + * is managed outside (UNMANAGED). > > > + */ > > > + if (!sm_supported(iommu) && > > > + !ecap_sc_support(iommu->ecap)) { > > > ret = 0; > > > break; > > > } > > > ---8<--- > > > > > > Does it mean that for some hardwares that has sm_supported()==true, it'll > > > have > > > SC bit cleared in ecap register? > > > > I guess not, so it's probably only the problem of vIOMMU. > > But then what does the code mean above? > > If SC is required for scalable mode,
I guess the code was tested on hardware with both SC and SM. > ecap_sc_support()==false already implies > sm_supported()==false too. Then that check seems redundant. To tell the truth, I'm not sure. And according to the spec SNP is reserved if SC==false. > > > > > > That sounds odd, and not sure why. Maybe Yi > > > Liu or Yi Sun may know? > > > > Another interesting point is that, it looks to me after that commit > > SNP is used for the domain that is not UNMANAGED even if PGSNP is not > > set. > > > > > > > > > > > > > > > To unbreak the guest, ignore the SNP bit for scalable mode first. In > > > > the future we may consider to add SC support. > > > > > > Oh yes, I remembered the last time we discussed this. Could you remind me > > > what's missing for us to support SC? > > > > Exactly what you described below. > > > > > > > > IIUC, for common device emulations we can just declare SC==1, right? > > > > Strictly speaking, only safe for the datapath that is running in the > > Qemu. For things like vhost-user, I'm not sure it can check CC when > > using VFIO. > > Hmm yeah.. though I'll just call those vhost-user backends to fall into below > "device assignment" category too. Great to know we're on the same page here. I see. > > > > > > As all > > > the DMAs (including kernel accels like vhost) will be from host > > > processors so > > > there're no coherent issues with guest vcpu threads. > > > > > > If that's correct, the only challenge is device assignment in any form (I > > > am > > > not familiar with vdpa; so perhaps that includes vfio, vpda and any other > > > kind > > > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE > > > capability from the host iommu groups that covers the assigned devices, > > > and we > > > only set SC==1 if we have cache coherency on all the devices? > > > > For VFIO yes, and we should prevent VFIO without CC to be plugged if > > SC is advertised. > > > > For vDPA, we don't need to worry about it at all, kernel vDPA forces > > IOMMU_CACHE now. > > > > vhost_vdpa_alloc_domain(): > > > > if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) > > return -ENOTSUPP; > > > > (For device with on-chip IOMMU, it's the parent and device that > > guarantees the CC) > > Ah right, yes you mentioned it and I forgot.. Though I'm not sure we'd simply > double-check again here (if we'll support vfio anyway, then we'll need to be > able to read those out from IOMMU groups), because we shouldn't rely on that > fact which is an implementation detail of vdpa, imho (say, when vdpa starts to > support !SC someday). Good point, but at that time we need to expose the !SC via uAPI which is currently not supported. > > PS: I have other comments below in previous reply - please have a look too! > :-D Sorry for missing that part :( > > > > > Thanks > > > > > > > > > > > > > > > Signed-off-by: Jason Wang <jasow...@redhat.com> > > > > --- > > > > hw/i386/intel_iommu.c | 18 ++++++++++++------ > > > > hw/i386/intel_iommu_internal.h | 2 ++ > > > > 2 files changed, 14 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > > > > index 294499ee20..3bcac56c3e 100644 > > > > --- a/hw/i386/intel_iommu.c > > > > +++ b/hw/i386/intel_iommu.c > > > > @@ -969,7 +969,8 @@ static dma_addr_t > > > > vtd_get_iova_pgtbl_base(IntelIOMMUState *s, > > > > static uint64_t vtd_spte_rsvd[5]; > > > > static uint64_t vtd_spte_rsvd_large[5]; > > > > > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level) > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s, > > > > + uint64_t slpte, uint32_t level) > > > > { > > > > uint64_t rsvd_mask = vtd_spte_rsvd[level]; > > > > > > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, > > > > uint32_t level) > > > > rsvd_mask = vtd_spte_rsvd_large[level]; > > > > } > > > > > > > > + if (s->scalable_mode) { > > > > + rsvd_mask &= ~VTD_SPTE_SNP; > > > > + } > > > > > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so > > > maybe > > > we still want to keep checking the bit 11 reserved for e.g. common > > > pgtable dir > > > entries? Maybe, but it's probably a question that can only be answered by Intel. I can change it for the next version if you stick. > > > > > > To do so, how about directly modifying the vtd_spte_rsvd* fields in > > > vtd_init()? > > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're: > > > > > > - vtd_spte_rsvd[1] (4K) > > > - vtd_spte_rsvd_large[2] (2M) > > > - vtd_spte_rsvd_large[3] (1G) > > > > > > What do you think? Then we avoid passing IntelIOMMUState* all over too. I started a version like that:), it should work, I will change that if it was agreed by everyone. The reason that I changed to pass IntelIOMMUState is that it results in a smaller changeset. The reason is that I tend to introduce new rsvd bits for SM mode since after checking vtd 3.3 it looks have different reserved bit requirement than before (at least 1.2) Thanks > > [Here] > > > > > > > > + > > > > return slpte & rsvd_mask; > > > > } > > > > > > > > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, > > > > VTDContextEntry *ce, > > > > iova, level, slpte, is_write); > > > > return is_write ? -VTD_FR_WRITE : -VTD_FR_READ; > > > > } > > > > - if (vtd_slpte_nonzero_rsvd(slpte, level)) { > > > > + if (vtd_slpte_nonzero_rsvd(s, slpte, level)) { > > > > error_report_once("%s: detected splte reserve non-zero " > > > > "iova=0x%" PRIx64 ", level=0x%" PRIx32 > > > > "slpte=0x%" PRIx64 ")", __func__, iova, > > > > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent > > > > *event, vtd_page_walk_info *info) > > > > * @write: whether parent level has write permission > > > > * @info: constant information for the page walk > > > > */ > > > > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start, > > > > +static int vtd_page_walk_level(IntelIOMMUState *s, > > > > + dma_addr_t addr, uint64_t start, > > > > uint64_t end, uint32_t level, bool read, > > > > bool write, vtd_page_walk_info *info) > > > > { > > > > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, > > > > uint64_t start, > > > > goto next; > > > > } > > > > > > > > - if (vtd_slpte_nonzero_rsvd(slpte, level)) { > > > > + if (vtd_slpte_nonzero_rsvd(s, slpte, level)) { > > > > trace_vtd_page_walk_skip_reserve(iova, iova_next); > > > > goto next; > > > > } > > > > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, > > > > uint64_t start, > > > > * This is a valid PDE (or even bigger than PDE). We need > > > > * to walk one further level. > > > > */ > > > > - ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, > > > > info->aw), > > > > + ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, > > > > info->aw), > > > > iova, MIN(iova_next, end), level > > > > - 1, > > > > read_cur, write_cur, info); > > > > } else { > > > > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, > > > > VTDContextEntry *ce, > > > > end = vtd_iova_limit(s, ce, info->aw); > > > > } > > > > > > > > - return vtd_page_walk_level(addr, start, end, level, true, true, > > > > info); > > > > + return vtd_page_walk_level(s, addr, start, end, level, true, true, > > > > info); > > > > } > > > > > > > > static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s, > > > > diff --git a/hw/i386/intel_iommu_internal.h > > > > b/hw/i386/intel_iommu_internal.h > > > > index 3d5487fe2c..a6c788049b 100644 > > > > --- a/hw/i386/intel_iommu_internal.h > > > > +++ b/hw/i386/intel_iommu_internal.h > > > > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc; > > > > #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8 > > > > > > > > /* Rsvd field masks for spte */ > > > > +#define VTD_SPTE_SNP 0x800ULL > > > > + > > > > #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \ > > > > dt_supported ? \ > > > > (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) > > > > : \ > > > > -- > > > > 2.25.1 > > > > > > > > > > -- > > > Peter Xu > > > > > > > -- > Peter Xu >