> -----Original Message-----
> From: Peter Xu <pet...@redhat.com>
> Sent: Monday, January 29, 2024 6:43 PM
> To: Liu, Yuan1 <yuan1....@intel.com>
> Cc: faro...@suse.de; leob...@redhat.com; qemu-devel@nongnu.org; Zou,
> Nanhai <nanhai....@intel.com>
> Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> Compression
> 
> On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > Hi,
> 
> Hi, Yuan,
> 
> I have a few comments and questions.  Many of them can be pure questions
> as I don't know enough on these new technologies.
> 
> >
> > I am writing to submit a code change aimed at enhancing live migration
> > acceleration by leveraging the compression capability of the Intel
> > In-Memory Analytics Accelerator (IAA).
> >
> > The implementation of the IAA (de)compression code is based on Intel
> > Query Processing Library (QPL), an open-source software project
> > designed for IAA high-level software programming.
> > https://github.com/intel/qpl
> >
> > In the last version, there was some discussion about whether to
> > introduce a new compression algorithm for IAA. Because the compression
> > algorithm of IAA hardware is based on deflate, and QPL already
> > supports Zlib, so in this version, I implemented IAA as an accelerator
> > for the Zlib compression method. However, due to some reasons, QPL is
> > currently not compatible with the existing Zlib method that Zlib
> > compressed data can be decompressed by QPl and vice versa.
> >
> > I have some concerns about the existing Zlib compression
> >   1. Will you consider supporting one channel to support multi-stream
> >      compression? Of course, this may lead to a reduction in compression
> >      ratio, but it will allow the hardware to process each stream
> >      concurrently. We can have each stream process multiple pages,
> >      reducing the loss of compression ratio. For example, 128 pages are
> >      divided into 16 streams for independent compression. I will provide
> >      the a early performance data in the next version(v4).
> 
> I think Juan used to ask similar question: how much this can help if
> multifd can already achieve some form of concurrency over the pages?


> Couldn't the user specify more multifd channels if they want to grant more
> cpu resource for comp/decomp purpose?
> 
> IOW, how many concurrent channels QPL can provide?  What is the suggested
> concurrency channels there?

From the QPL software, there is no limit on the number of concurrent 
compression and decompression tasks.
From the IAA hardware, one IAA physical device can process two compressions 
concurrently or eight decompression tasks concurrently. There are up to 8 IAA 
devices on an Intel SPR Server and it will vary according to the customer’s 
product selection and deployment.

Regarding the requirement for the number of concurrent channels, I think this 
may not be a bottleneck problem.
Please allow me to introduce a little more here

1. If the compression design is based on Zlib/Deflate/Gzip streaming mode, then 
we indeed need more channels to maintain concurrent processing. Because each 
time a multifd packet is compressed (including 128 independent pages), it needs 
to be compressed page by page. These 128 pages are not concurrent. The 
concurrency is reflected in the logic of multiple channels for the multifd 
packet.

2. Through testing, we prefer concurrent processing on 4K pages, not multifd 
packet, which means that 128 pages belonging to a packet can be 
compressed/decompressed concurrently. Even one channel can also utilize all the 
resources of IAA. But this is not compatible with existing zlib.
The code is similar to the following
  for(int i = 0; i < num_pages; i++) {
    job[i]->input_data = pages[i]
    submit_job(job[i] //Non-block submit for compression/decompression tasks
  }
  for(int i = 0; i < num_pages; i++) {
    wait_job(job[i])  //busy polling. In the future, we will make this part and 
data sending into pipeline mode.
  } 

3. Currently, the patches we provide to the community are based on streaming 
compression. This is to be compatible with the current zlib method. However, we 
found that there are still many problems with this, so we plan to provide a new 
change in the next version that the independent QPL/IAA acceleration function 
as said above.
Compatibility issues include the following
    1. QPL currently does not support the z_sync_flush operation
    2. IAA comp/decomp window is fixed 4K. By default, the zlib window size is 
32K. And window size should be the same for Both comp/decomp sides. 
    3. At the same time, I researched the QAT compression scheme. QATzip 
currently does not support zlib, nor does it support z_sync_flush. The window 
size is 32K

In general, I think it is a good suggestion to make the accelerator compatible 
with standard compression algorithms, but also let the accelerator run 
independently, thus avoiding some compatibility and performance problems of the 
accelerator. For example, we can add the "accel" option to the compression 
method, and then the user must specify the same accelerator by compression 
accelerator parameter on the source and remote ends (just like specifying the 
same compression algorithm)

> >
> >   2. Will you consider using QPL/IAA as an independent compression
> >      algorithm instead of an accelerator? In this way, we can better
> >      utilize hardware performance and some features, such as IAA's
> >      canned mode, which can be dynamically generated by some statistics
> >      of data. A huffman table to improve the compression ratio.
> 
> Maybe one more knob will work?  If it's not compatible with the deflate
> algo maybe it should never be the default.  IOW, the accelerators may be
> extended into this (based on what you already proposed):
> 
>   - auto ("qpl" first, "none" second; never "qpl-optimized")
>   - none (old zlib)
>   - qpl (qpl compatible)
>   - qpl-optimized (qpl uncompatible)
> 
> Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
> user can select it explicit, but only on both sides of QEMU.
Yes, this is what I want, I need a way that QPL is not compatible with zlib. 
From my current point of view, if zlib chooses raw defalte mode, then QAT will 
be compatible with the current community's zlib solution.
So my suggestion is as follows

Compression method parameter
 - none
 - zlib
 - zstd
 - accel (Both Qemu sides need to select the same accelerator from "Compression 
accelerator parameter" explicitly).

Compression accelerator parameter
 - auto
 - none
 - qpl (qpl will not support zlib/zstd, it will inform an error when zlib/zstd 
is selected)
 - qat (it can provide acceleration of zlib/zstd)

> > Test condition:
> >   1. Host CPUs are based on Sapphire Rapids, and frequency locked to
> 3.4G
> >   2. VM type, 16 vCPU and 64G memory
> >   3. The Idle workload means no workload is running in the VM
> >   4. The Redis workload means YCSB workloadb + Redis Server are running
> >      in the VM, about 20G or more memory will be used.
> >   5. Source side migartion configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter downtime-limit 300
> >      d. migrate_set_parameter multifd-compression zlib
> >      e. migrate_set_parameter multifd-compression-accel none/qpl
> >      f. migrate_set_parameter max-bandwidth 100G
> >   6. Desitination side migration configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter multifd-compression zlib
> >      d. migrate_set_parameter multifd-compression-accel none/qpl
> >      e. migrate_set_parameter max-bandwidth 100G
> 
> How is zlib-level setup?  Default (1)?
Yes, use level 1 the default level.

> Btw, it seems both zlib/zstd levels are not even working right now to be
> configured.. probably overlooked in migrate_params_apply().
Ok, I will check this.

> > Early migration result, each result is the average of three tests
> > +--------+-------------+--------+--------+---------+----+-----+
> >  |        | The number  |total   |downtime|network  |pages per |
> >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> >  |        | and mode    |        |        |(mbps)   |          |
> >  |        +-------------+-----------------+---------+----------+
> >  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
> >  |        +-------------+--------+--------+---------+----------+
> >  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
> >  |workload+-------------+--------+--------+---------+----------+
> >  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
> 
> The number is slightly confusing to me.  If IAA can send 3x times more
> pages per-second, shouldn't the total migration time 1/3 of the other if
> the guest is idle?  But the total times seem to be pretty close no matter
> N of channels. Maybe I missed something?

This data is the information read from "info migrate" after the live migration 
status changes to "complete".
I think it is the max throughout when expected downtime and network available 
bandwidth are met.
In vCPUs are idle, live migration does not run at maximum throughput for too 
long.

> >  +--------+-------------+--------+--------+---------+----------+
> >
> >  +--------+-------------+--------+--------+---------+----+-----+
> >  |        | The number  |total   |downtime|network  |pages per |
> >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> >  |        | and mode    |        |        |(mbps)   |          |
> >  |        +-------------+-----------------+---------+----------+
> >  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
> >  |        +-------------+--------+--------+---------+----------+
> >  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
> >  |workload+-------------+--------+--------+---------+----------+
> >  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
> >  +--------+-------------+--------+--------+---------+----------+
> 
> The redis results look much more preferred on using IAA comparing to the
> idle tests.  Does it mean that IAA works less good with zero pages in
> general (assuming that'll be the majority in idle test)?
Both Idle and Redis data are not the best performance for IAA since it is based 
on multifd packet streaming compression.
In the idle case, most pages are indeed zero page, zero page compression is not 
as good as only detecting zero pages, so the compression advantage is not 
reflected.

> From the manual, I see that IAA also supports encryption/decryption.
> Would it be able to accelerate TLS?
From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't support 
encryption/decryption. This feature may be available in future generations
For TLS acceleration, QAT supports this function on SPR/EMR and has successful 
cases in some scenarios.
https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-https-with-qat-tuning-guide.html

> How should one consider IAA over QAT?  What is the major difference?  I
> see that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW
> is something attached to the pcie bus (assume QAT the same)?

Regarding the difference between using IAA or QAT for compression
1. IAA is more suitable for 4K compression, and QAT is suitable for large block 
data compression. This is determined by the deflate windows size, and QAT can 
support more compression levels. IAA hardware supports 1 compression level.
2. From the perspective of throughput, one IAA device supports compression 
throughput is 4GBps and decompression is 30GBps. One QAT support compression or 
decompression throughput is 20GBps.
3. Depending on the product type selected by the customer and the deployment, 
the resources used for live migration will also be different.

Regarding the IOMMU scalable mode
1. The current IAA software stack requires Shared Virtual Memory (SVM) 
technology, and SVM depends on IOMMU scalable mode.
2. Both IAA and QAT support PCIe PASID capability, then IAA can support shared 
work queue.
https://docs.kernel.org/next/x86/sva.html

> Thanks,
> 
> --
> Peter Xu

Reply via email to