multifd: implement initialization of qpl compression

Liu, Yuan1 Fri, 22 Mar 2024 07:48:02 -0700

> -----Original Message-----
> From: Liu, Yuan1
> Sent: Friday, March 22, 2024 10:07 AM
> To: Peter Xu <pet...@redhat.com>
> Cc: Daniel P. Berrangé <berra...@redhat.com>; faro...@suse.de; qemu-
> de...@nongnu.org; hao.xi...@bytedance.com; bryan.zh...@bytedance.com; Zou,
> Nanhai <nanhai....@intel.com>
> Subject: RE: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> > -----Original Message-----
> > From: Peter Xu <pet...@redhat.com>
> > Sent: Thursday, March 21, 2024 11:28 PM
> > To: Liu, Yuan1 <yuan1....@intel.com>
> > Cc: Daniel P. Berrangé <berra...@redhat.com>; faro...@suse.de; qemu-
> > de...@nongnu.org; hao.xi...@bytedance.com; bryan.zh...@bytedance.com;
> Zou,
> > Nanhai <nanhai....@intel.com>
> > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization
> of
> > qpl compression
> >
> > On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Peter Xu <pet...@redhat.com>
> > > > Sent: Thursday, March 21, 2024 4:32 AM
> > > > To: Liu, Yuan1 <yuan1....@intel.com>
> > > > Cc: Daniel P. Berrangé <berra...@redhat.com>; faro...@suse.de; qemu-
> > > > de...@nongnu.org; hao.xi...@bytedance.com;
> bryan.zh...@bytedance.com;
> > Zou,
> > > > Nanhai <nanhai....@intel.com>
> > > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> > initialization of
> > > > qpl compression
> > > >
> > > > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > > > let me explain here, during the decompression operation of IAA,
> the
> > > > > decompressed data can be directly output to the virtual address of
> > the
> > > > > guest memory by IAA hardware.  It can avoid copying the
> decompressed
> > > > data
> > > > > to guest memory by CPU.
> > > >
> > > > I see.
> > > >
> > > > > Without -mem-prealloc, all the guest memory is not populated, and
> > IAA
> > > > > hardware needs to trigger I/O page fault first and then output the
> > > > > decompressed data to the guest memory region.  Besides that, CPU
> > page
> > > > > faults will also trigger IOTLB flush operation when IAA devices
> use
> > SVM.
> > > >
> > > > Oh so the IAA hardware already can use CPU pgtables?  Nice..
> > > >
> > > > Why IOTLB flush is needed?  AFAIU we're only installing new pages,
> the
> > > > request can either come from a CPU access or a DMA.  In all cases
> > there
> > > > should have no tearing down of an old page.  Isn't an iotlb flush
> only
> > > > needed if a tear down happens?
> > >
> > > As far as I know, IAA hardware uses SVM technology to use the CPU's
> page
> > table
> > > for address translation (IOMMU scalable mode directly accesses the CPU
> > page table).
> > > Therefore, when the CPU page table changes, the device's Invalidation
> > operation needs
> > > to be triggered to update the IOMMU and the device's cache.
> > >
> > > My current kernel version is mainline 6.2. The issue I see is as
> > follows:
> > > --Handle_mm_fault
> > >  |
> > >   -- wp_page_copy
> >
> > This is the CoW path.  Not usual at all..
> >
> > I assume this issue should only present on destination.  Then the guest
> > pages should be the destination of such DMAs to happen, which means
> these
> > should be write faults, and as we see here it is, otherwise it won't
> > trigger a CoW.
> >
> > However it's not clear to me why a pre-installed zero page existed.  It
> > means someone read the guest pages first.
> >
> > It might be interesting to know _why_ someone reads the guest pages,
> even
> > if we know they're all zeros.  If we can avoid such reads then it'll be
> a
> > hole rather than a prefaulted read on zero page, then invalidations are
> > not
> > needed, and I expect that should fix the iotlb storm issue.
> 
> The received pages will be read for zero pages check first. Although
> these pages are zero pages, and IAA hardware will not access them, the
> COW happens and causes following IOTLB flush operation. As far as I know,
> IOMMU quickly detects whether the address range has been used by the
> device,
> and does not invalidate the address that is not used by the device, this
> has
> not yet been resolved in Linux kernel 6.2. I will check the latest status
> for
> this.


I checked the Linux mainline 6.8 code, there are no big changes for this.
In version 6.8, if the process needs to flush MMU TLB, then I/O TLB flush
will be also triggered when the process has SVM devices. I haven't found
the code to check if pages have been set EA (Extended-Accessed) bit before
submitting invalidation operations, this is same with version 6.2.

VT-d 3.6.2
If the Extended-Accessed-Flag-Enable (EAFE) is 1 in a scalable-mode PASID-table
entry that references a first-stage paging-structure entry used by the remapping
hardware, it atomically sets the EA field in that entry. Whenever EA field is 
atomically set, the A field is also set in the same atomic operation. For 
software
usages where the first-stage paging structures are shared across heterogeneous 
agents
(e.g., CPUs and accelerator devices such as GPUs), the EA flag may be used by 
software
to identify pages accessed by non-CPU agent(s) (as opposed to the A flag which 
indicates
access by any agent sharing the paging structures).

> void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> {
>     for (int i = 0; i < p->zero_num; i++) {
>         void *page = p->host + p->zero[i];
>         if (!buffer_is_zero(page, p->page_size)) {
>             memset(page, 0, p->page_size);
>         }
>     }
> }
> 
> 
> > It'll still be good we can fix this first to not make qpl special from
> > this
> > regard, so that the hope is migration submodule shouldn't rely on any
> > pre-config (-mem-prealloc) on guest memory behaviors to work properly.
> 
> Even if the IOTLB problem can be avoided, the I/O page fault problem
> (normal
> pages are loaded by the IAA device and solving normal page faults through
> IOMMU,
> the performance is not good)
> 
> It can let the decompressed data of the IAA device be output to a pre-
> populated
> memory instead of directly outputting to the guest address, but then each
> multifd
> thread needs two memory copies, one copy from the network to the IAA input
> memory(pre-populated), and another copy from the IAA output memory(pre-
> populated)
> to the guest address, which may become a performance bottleneck at the
> destination
> during the live migration process.
> 
> So I think it is still necessary to use the -mem-prealloc option
> 
> > >     -- mmu_notifier_invalidate_range
> > >       |
> > >       -- intel_invalidate_rage
> > >         |
> > >         -- qi_flush_piotlb
> > >         -- qi_flush_dev_iotlb_pasid
> >
> > --
> > Peter Xu

RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression

Reply via email to