RE: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB

Liu, Yuan1 Fri, 05 Jul 2024 01:33:23 -0700

> -----Original Message-----
> From: Peter Xu <pet...@redhat.com>
> Sent: Thursday, July 4, 2024 11:36 PM
> To: Liu, Yuan1 <yuan1....@intel.com>
> Cc: Wang, Yichen <yichen.w...@bytedance.com>; Paolo Bonzini
> <pbonz...@redhat.com>; Daniel P. Berrangé <berra...@redhat.com>; Eduardo
> Habkost <edua...@habkost.net>; Marc-André Lureau
> <marcandre.lur...@redhat.com>; Thomas Huth <th...@redhat.com>; Philippe
> Mathieu-Daudé <phi...@linaro.org>; Fabiano Rosas <faro...@suse.de>; Eric
> Blake <ebl...@redhat.com>; Markus Armbruster <arm...@redhat.com>; Laurent
> Vivier <lviv...@redhat.com>; qemu-devel@nongnu.org; Hao Xiang
> <hao.xi...@linux.dev>; Zou, Nanhai <nanhai....@intel.com>; Ho-Ren (Jack)
> Chuang <horenchu...@bytedance.com>
> Subject: Re: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB
> 
> On Thu, Jul 04, 2024 at 03:15:51AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <pet...@redhat.com>
> > > Sent: Wednesday, July 3, 2024 3:16 AM
> > > To: Wang, Yichen <yichen.w...@bytedance.com>
> > > Cc: Paolo Bonzini <pbonz...@redhat.com>; Daniel P. Berrangé
> > > <berra...@redhat.com>; Eduardo Habkost <edua...@habkost.net>; Marc-
> André
> > > Lureau <marcandre.lur...@redhat.com>; Thomas Huth <th...@redhat.com>;
> > > Philippe Mathieu-Daudé <phi...@linaro.org>; Fabiano Rosas
> > > <faro...@suse.de>; Eric Blake <ebl...@redhat.com>; Markus Armbruster
> > > <arm...@redhat.com>; Laurent Vivier <lviv...@redhat.com>; qemu-
> > > de...@nongnu.org; Hao Xiang <hao.xi...@linux.dev>; Liu, Yuan1
> > > <yuan1....@intel.com>; Zou, Nanhai <nanhai....@intel.com>; Ho-Ren
> (Jack)
> > > Chuang <horenchu...@bytedance.com>
> > > Subject: Re: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB
> > >
> > > On Thu, Jun 27, 2024 at 03:34:41PM -0700, Yichen Wang wrote:
> > > > v3:
> > > > - Rebase changes on top of master
> > > > - Merge two patches per Fabiano Rosas's comment
> > > > - Add versions into comments and documentations
> > > >
> > > > v2:
> > > > - Rebase changes on top of recent multifd code changes.
> > > > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> > > > - Remove parameter tuning and use QATzip's defaults for better
> > > >   performance.
> > > > - Add parameter to enable QAT software fallback.
> > > >
> > > > v1:
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2023-
> 12/msg03761.html
> > > >
> > > > * Performance
> > > >
> > > > We present updated performance results. For circumstantial reasons,
> v1
> > > > presented performance on a low-bandwidth (1Gbps) network.
> > > >
> > > > Here, we present updated results with a similar setup as before but
> with
> > > > two main differences:
> > > >
> > > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> > > > 2. We had a bug in our memory allocation causing us to only use ~1/2
> of
> > > > the VM's RAM. Now we properly allocate and fill nearly all of the
> VM's
> > > > RAM.
> > > >
> > > > Thus, the test setup is as follows:
> > > >
> > > > We perform multifd live migration over TCP using a VM with 64GB
> memory.
> > > > We prepare the machine's memory by powering it on, allocating a
> large
> > > > amount of memory (60GB) as a single buffer, and filling the buffer
> with
> > > > the repeated contents of the Silesia corpus[0]. This is in lieu of a
> > > more
> > > > realistic memory snapshot, which proved troublesome to acquire.
> > > >
> > > > We analyze CPU usage by averaging the output of 'top' every second
> > > > during migration. This is admittedly imprecise, but we feel that it
> > > > accurately portrays the different degrees of CPU usage of varying
> > > > compression methods.
> > > >
> > > > We present the latency, throughput, and CPU usage results for all of
> the
> > > > compression methods, with varying numbers of multifd threads (4, 8,
> and
> > > > 16).
> > > >
> > > > [0] The Silesia corpus can be accessed here:
> > > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > > >
> > > > ** Results
> > > >
> > > > 4 multifd threads:
> > > >
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> > > cpu%|
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |qatzip         | 23.13         | 8749.94        |117.50
> |186.49
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |zlib           |254.35         |  771.87        |388.20
> |144.40
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |zstd           | 54.52         | 3442.59        |414.59
> |149.77
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |none           | 12.45         |43739.60        |159.71
> |204.96
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >
> > > > 8 multifd threads:
> > > >
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> > > cpu%|
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |qatzip         | 16.91         |12306.52        |186.37
> |391.84
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |zlib           |130.11         | 1508.89        |753.86
> |289.35
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |zstd           | 27.57         | 6823.23        |786.83
> |303.80
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |none           | 11.82         |46072.63        |163.74
> |238.56
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >
> > > > 16 multifd threads:
> > > >
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> > > cpu%|
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |qatzip         |18.64          |11044.52        | 573.61
> |437.65
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |zlib           |66.43          | 2955.79        |1469.68
> |567.47
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |zstd           |14.17          |13290.66        |1504.08
> |615.33
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >     |none           |16.82          |32363.26        | 180.74
> |217.17
> > > |
> > > >     |---------------|---------------|----------------|---------|----
> ----
> > > -|
> > > >
> > > > ** Observations
> > > >
> > > > - In general, not using compression outperforms using compression in
> a
> > > >   non-network-bound environment.
> > > > - 'qatzip' outperforms other compression workers with 4 and 8
> workers,
> > > >   achieving a ~91% latency reduction over 'zlib' with 4 workers, and
> a
> > > > ~58% latency reduction over 'zstd' with 4 workers.
> > > > - 'qatzip' maintains comparable performance with 'zstd' at 16
> workers,
> > > >   showing a ~32% increase in latency. This performance difference
> > > > becomes more noticeable with more workers, as CPU compression is
> highly
> > > > parallelizable.
> > > > - 'qatzip' compression uses considerably less CPU than other
> compression
> > > >   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> > > > compression CPU usage compared to 'zstd' and 'zlib'.
> > > > - 'qatzip' decompression CPU usage is less impressive, and is even
> > > >   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16
> workers.
> > >
> > > Thanks for the results update.
> > >
> > > It looks like the docs/migration/ file is still missing.  It'll be
> great
> > > to
> > > have it in the next version or separately.
> > >
> > > So how it compares with QPL (which got merged already)?  They at least
> > > look
> > > like both supported on an Intel platform, so an user whoever wants to
> > > compress the RAM could start to look at both.  I'm utterly confused on
> why
> > > Intel provides these two similar compressors.  It would be great to
> have
> > > some answer and perhaps put into the doc.
> 
> Yuan,
> 
> >
> > I would like to explain some of the reasons why we want to merge the
> > two QAT and IAA solutions into the community.
> 
> Yes, these are very helpful information.  Please consider putting them
> into
> the cover letter if there's a repost, and perhaps also in the doc/ files.
> 
> >
> > 1. Although Intel Xeon Sapphire Rapids supports both QAT and IAA,
> different
> >    SKUs support different numbers of QAT and IAA, so some users do not
> have
> >    both QAT and IAA at the same time.
> >
> > 2. QAT products include PCIe cards, which are compatible with older Xeon
> >    products and other non-Xeon products. And some users have already
> used QAT
> >    cards to accelerate live migration.
> 
> Ah, this makes some sense to me.
> 
> So a previous question always haunted me, where I wondered why an user who
> bought all these fancy and expensive processors with QAT, would still like
> to not invest on a better network of 50G or more, but stick with 10Gs
> ancient NICs and switches.
> 
> So what you're saying is logically in some old clusters with old chips and
> old network solutions, it's possible that user buys these PCIe cards
> separately so it may help with that old infra migrate VMs faster.  Is that
> the case?


Yes, users do not add a QAT card just for live migration. Users mainly use 
QAT-SRIOV technology to help cloud users offload compression and encryption.

> If so, we may still want some numbers showing how this performs in a
> network-limited environment, and how that helps users to migrate.  Sorry
> if
> there's some back-and-forth requirement asking for these numbers, but I
> think these are still important information when an user would like to
> decide whether to use these features.  Again, put that into docs/ if
> proper
> would be nice too.

Yes, I will provide some performance data at some specific 
bandwidths(100Mbps/1Gbps/10Gbps). And add documentation to explain 
the advantages of using QAT 

> >
> > 3. In addition to compression, QAT and IAA also support various other
> features
> >    to better serve different workloads. Here is an introduction to the
> accelerators,
> >    including usage scenarios of QAT and IAA.
> >    https://www.intel.com/content/dam/www/central-
> libraries/us/en/documents/2022-12/storage-engines-4th-gen-xeon-brief.pdf
> 
> Thanks for the link.
> 
> However this doesn't look like a reason to support it in migration?  It
> needs to help migration in some form or another, no matter how many
> features it provides.. since migration may not consume them.
> 
> Two major (but pure..) questions:
> 
>   1) Why high cpu usage?
> 
>      I raised this question below [1] too but I didn't yet get an answer.
>      Ror 8-chan multifd, it's ~390% (QAT) v.s. ~240% (nocomp), even if
>      46Gbps bw for the latter... so when throttled it will be even lower?
> 
>      The paper you provided above has this upfront:
> 
>         When a CPU can offload storage functions to built-in accelerators,
>         it frees up cores for business-critical workloads...
> 
>      Isn't that a major feature to be able to "offload" things?  Why cpu
>      isn't freed even if the offload happened?

Yes, it doesn't make sense, I will check this.

>   2) TLS?
> 
>      I think I asked before, I apologize if any of you've already answered
>      and if I forgot.. but have any of you looked into offload TLS
> (instead
>      of compression) with the QATs?

I'm sorry for not responding to the previous question about TLS. QAT has many 
related success cases (for example, OpenSSL). 
https://www.intel.com/content/dam/www/public/us/en/documents/solution-briefs/accelerating-openssl-brief.pdf

I will send a separate RFC or patch about this part because the software stacks 
of QAT 
compression and encryption are independent, so we discuss them separately.

> Thanks,
> 
> >
> > For users who have both QAT and IAA, we recommend the following for
> choosing a
> > live migration solution:
> >
> > 1. If the number of QAT devices is equal to or greater than the number
> of IAA devices
> >    and network bandwidth is limited, it is recommended to use the
> QATZip(QAT) solution.
> >
> > 2. In other scenarios, the QPL (IAA) solution can be used first.
> >
> > > I am honestly curious too on whether are you planning to use it in
> > > production.  It looks like if the network resources are rich, no-comp
> is
> > > mostly always better than qatzip, no matter on total migration time or
> cpu
> > > consumption.  I'm pretty surprised that it'll take that much resources
> > > even
> > > if the work should have been offloaded to the QAT chips iiuc.
> 
> [1]
> 
> > >
> > > I think it may not be a problem to merge this series even if it
> performs
> > > slower at some criterias.. but I think we may still want to know when
> this
> > > should be used, or the good reason this should be merged (if it's not
> > > about
> > > it outperforms others).
> > >
> > > Thanks,
> > >
> > > >
> > > >
> > > > Bryan Zhang (4):
> > > >   meson: Introduce 'qatzip' feature to the build system
> > > >   migration: Add migration parameters for QATzip
> > > >   migration: Introduce 'qatzip' compression method
> > > >   tests/migration: Add integration test for 'qatzip' compression
> method
> > > >
> > > >  hw/core/qdev-properties-system.c |   6 +-
> > > >  meson.build                      |  10 +
> > > >  meson_options.txt                |   2 +
> > > >  migration/meson.build            |   1 +
> > > >  migration/migration-hmp-cmds.c   |   8 +
> > > >  migration/multifd-qatzip.c       | 382
> +++++++++++++++++++++++++++++++
> > > >  migration/multifd.h              |   1 +
> > > >  migration/options.c              |  57 +++++
> > > >  migration/options.h              |   2 +
> > > >  qapi/migration.json              |  38 +++
> > > >  scripts/meson-buildoptions.sh    |   6 +
> > > >  tests/qtest/meson.build          |   4 +
> > > >  tests/qtest/migration-test.c     |  35 +++
> > > >  13 files changed, 551 insertions(+), 1 deletion(-)
> > > >  create mode 100644 migration/multifd-qatzip.c
> > > >
> > > > --
> > > > Yichen Wang
> > > >
> > >
> > > --
> > > Peter Xu
> >
> 
> --
> Peter Xu

RE: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB

Reply via email to