> -----Original Message----- > From: Peter Xu <pet...@redhat.com> > Sent: Thursday, July 4, 2024 11:36 PM > To: Liu, Yuan1 <yuan1....@intel.com> > Cc: Wang, Yichen <yichen.w...@bytedance.com>; Paolo Bonzini > <pbonz...@redhat.com>; Daniel P. Berrangé <berra...@redhat.com>; Eduardo > Habkost <edua...@habkost.net>; Marc-André Lureau > <marcandre.lur...@redhat.com>; Thomas Huth <th...@redhat.com>; Philippe > Mathieu-Daudé <phi...@linaro.org>; Fabiano Rosas <faro...@suse.de>; Eric > Blake <ebl...@redhat.com>; Markus Armbruster <arm...@redhat.com>; Laurent > Vivier <lviv...@redhat.com>; qemu-devel@nongnu.org; Hao Xiang > <hao.xi...@linux.dev>; Zou, Nanhai <nanhai....@intel.com>; Ho-Ren (Jack) > Chuang <horenchu...@bytedance.com> > Subject: Re: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB > > On Thu, Jul 04, 2024 at 03:15:51AM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Peter Xu <pet...@redhat.com> > > > Sent: Wednesday, July 3, 2024 3:16 AM > > > To: Wang, Yichen <yichen.w...@bytedance.com> > > > Cc: Paolo Bonzini <pbonz...@redhat.com>; Daniel P. Berrangé > > > <berra...@redhat.com>; Eduardo Habkost <edua...@habkost.net>; Marc- > André > > > Lureau <marcandre.lur...@redhat.com>; Thomas Huth <th...@redhat.com>; > > > Philippe Mathieu-Daudé <phi...@linaro.org>; Fabiano Rosas > > > <faro...@suse.de>; Eric Blake <ebl...@redhat.com>; Markus Armbruster > > > <arm...@redhat.com>; Laurent Vivier <lviv...@redhat.com>; qemu- > > > de...@nongnu.org; Hao Xiang <hao.xi...@linux.dev>; Liu, Yuan1 > > > <yuan1....@intel.com>; Zou, Nanhai <nanhai....@intel.com>; Ho-Ren > (Jack) > > > Chuang <horenchu...@bytedance.com> > > > Subject: Re: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB > > > > > > On Thu, Jun 27, 2024 at 03:34:41PM -0700, Yichen Wang wrote: > > > > v3: > > > > - Rebase changes on top of master > > > > - Merge two patches per Fabiano Rosas's comment > > > > - Add versions into comments and documentations > > > > > > > > v2: > > > > - Rebase changes on top of recent multifd code changes. > > > > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers. > > > > - Remove parameter tuning and use QATzip's defaults for better > > > > performance. > > > > - Add parameter to enable QAT software fallback. > > > > > > > > v1: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2023- > 12/msg03761.html > > > > > > > > * Performance > > > > > > > > We present updated performance results. For circumstantial reasons, > v1 > > > > presented performance on a low-bandwidth (1Gbps) network. > > > > > > > > Here, we present updated results with a similar setup as before but > with > > > > two main differences: > > > > > > > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'. > > > > 2. We had a bug in our memory allocation causing us to only use ~1/2 > of > > > > the VM's RAM. Now we properly allocate and fill nearly all of the > VM's > > > > RAM. > > > > > > > > Thus, the test setup is as follows: > > > > > > > > We perform multifd live migration over TCP using a VM with 64GB > memory. > > > > We prepare the machine's memory by powering it on, allocating a > large > > > > amount of memory (60GB) as a single buffer, and filling the buffer > with > > > > the repeated contents of the Silesia corpus[0]. This is in lieu of a > > > more > > > > realistic memory snapshot, which proved troublesome to acquire. > > > > > > > > We analyze CPU usage by averaging the output of 'top' every second > > > > during migration. This is admittedly imprecise, but we feel that it > > > > accurately portrays the different degrees of CPU usage of varying > > > > compression methods. > > > > > > > > We present the latency, throughput, and CPU usage results for all of > the > > > > compression methods, with varying numbers of multifd threads (4, 8, > and > > > > 16). > > > > > > > > [0] The Silesia corpus can be accessed here: > > > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia > > > > > > > > ** Results > > > > > > > > 4 multifd threads: > > > > > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |method |time(sec) |throughput(mbps)|send cpu%|recv > > > cpu%| > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |qatzip | 23.13 | 8749.94 |117.50 > |186.49 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |zlib |254.35 | 771.87 |388.20 > |144.40 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |zstd | 54.52 | 3442.59 |414.59 > |149.77 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |none | 12.45 |43739.60 |159.71 > |204.96 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > > > > > 8 multifd threads: > > > > > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |method |time(sec) |throughput(mbps)|send cpu%|recv > > > cpu%| > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |qatzip | 16.91 |12306.52 |186.37 > |391.84 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |zlib |130.11 | 1508.89 |753.86 > |289.35 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |zstd | 27.57 | 6823.23 |786.83 > |303.80 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |none | 11.82 |46072.63 |163.74 > |238.56 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > > > > > 16 multifd threads: > > > > > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |method |time(sec) |throughput(mbps)|send cpu%|recv > > > cpu%| > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |qatzip |18.64 |11044.52 | 573.61 > |437.65 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |zlib |66.43 | 2955.79 |1469.68 > |567.47 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |zstd |14.17 |13290.66 |1504.08 > |615.33 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > |none |16.82 |32363.26 | 180.74 > |217.17 > > > | > > > > |---------------|---------------|----------------|---------|---- > ---- > > > -| > > > > > > > > ** Observations > > > > > > > > - In general, not using compression outperforms using compression in > a > > > > non-network-bound environment. > > > > - 'qatzip' outperforms other compression workers with 4 and 8 > workers, > > > > achieving a ~91% latency reduction over 'zlib' with 4 workers, and > a > > > > ~58% latency reduction over 'zstd' with 4 workers. > > > > - 'qatzip' maintains comparable performance with 'zstd' at 16 > workers, > > > > showing a ~32% increase in latency. This performance difference > > > > becomes more noticeable with more workers, as CPU compression is > highly > > > > parallelizable. > > > > - 'qatzip' compression uses considerably less CPU than other > compression > > > > methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in > > > > compression CPU usage compared to 'zstd' and 'zlib'. > > > > - 'qatzip' decompression CPU usage is less impressive, and is even > > > > slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 > workers. > > > > > > Thanks for the results update. > > > > > > It looks like the docs/migration/ file is still missing. It'll be > great > > > to > > > have it in the next version or separately. > > > > > > So how it compares with QPL (which got merged already)? They at least > > > look > > > like both supported on an Intel platform, so an user whoever wants to > > > compress the RAM could start to look at both. I'm utterly confused on > why > > > Intel provides these two similar compressors. It would be great to > have > > > some answer and perhaps put into the doc. > > Yuan, > > > > > I would like to explain some of the reasons why we want to merge the > > two QAT and IAA solutions into the community. > > Yes, these are very helpful information. Please consider putting them > into > the cover letter if there's a repost, and perhaps also in the doc/ files. > > > > > 1. Although Intel Xeon Sapphire Rapids supports both QAT and IAA, > different > > SKUs support different numbers of QAT and IAA, so some users do not > have > > both QAT and IAA at the same time. > > > > 2. QAT products include PCIe cards, which are compatible with older Xeon > > products and other non-Xeon products. And some users have already > used QAT > > cards to accelerate live migration. > > Ah, this makes some sense to me. > > So a previous question always haunted me, where I wondered why an user who > bought all these fancy and expensive processors with QAT, would still like > to not invest on a better network of 50G or more, but stick with 10Gs > ancient NICs and switches. > > So what you're saying is logically in some old clusters with old chips and > old network solutions, it's possible that user buys these PCIe cards > separately so it may help with that old infra migrate VMs faster. Is that > the case?
Yes, users do not add a QAT card just for live migration. Users mainly use QAT-SRIOV technology to help cloud users offload compression and encryption. > If so, we may still want some numbers showing how this performs in a > network-limited environment, and how that helps users to migrate. Sorry > if > there's some back-and-forth requirement asking for these numbers, but I > think these are still important information when an user would like to > decide whether to use these features. Again, put that into docs/ if > proper > would be nice too. Yes, I will provide some performance data at some specific bandwidths(100Mbps/1Gbps/10Gbps). And add documentation to explain the advantages of using QAT > > > > 3. In addition to compression, QAT and IAA also support various other > features > > to better serve different workloads. Here is an introduction to the > accelerators, > > including usage scenarios of QAT and IAA. > > https://www.intel.com/content/dam/www/central- > libraries/us/en/documents/2022-12/storage-engines-4th-gen-xeon-brief.pdf > > Thanks for the link. > > However this doesn't look like a reason to support it in migration? It > needs to help migration in some form or another, no matter how many > features it provides.. since migration may not consume them. > > Two major (but pure..) questions: > > 1) Why high cpu usage? > > I raised this question below [1] too but I didn't yet get an answer. > Ror 8-chan multifd, it's ~390% (QAT) v.s. ~240% (nocomp), even if > 46Gbps bw for the latter... so when throttled it will be even lower? > > The paper you provided above has this upfront: > > When a CPU can offload storage functions to built-in accelerators, > it frees up cores for business-critical workloads... > > Isn't that a major feature to be able to "offload" things? Why cpu > isn't freed even if the offload happened? Yes, it doesn't make sense, I will check this. > 2) TLS? > > I think I asked before, I apologize if any of you've already answered > and if I forgot.. but have any of you looked into offload TLS > (instead > of compression) with the QATs? I'm sorry for not responding to the previous question about TLS. QAT has many related success cases (for example, OpenSSL). https://www.intel.com/content/dam/www/public/us/en/documents/solution-briefs/accelerating-openssl-brief.pdf I will send a separate RFC or patch about this part because the software stacks of QAT compression and encryption are independent, so we discuss them separately. > Thanks, > > > > > For users who have both QAT and IAA, we recommend the following for > choosing a > > live migration solution: > > > > 1. If the number of QAT devices is equal to or greater than the number > of IAA devices > > and network bandwidth is limited, it is recommended to use the > QATZip(QAT) solution. > > > > 2. In other scenarios, the QPL (IAA) solution can be used first. > > > > > I am honestly curious too on whether are you planning to use it in > > > production. It looks like if the network resources are rich, no-comp > is > > > mostly always better than qatzip, no matter on total migration time or > cpu > > > consumption. I'm pretty surprised that it'll take that much resources > > > even > > > if the work should have been offloaded to the QAT chips iiuc. > > [1] > > > > > > > I think it may not be a problem to merge this series even if it > performs > > > slower at some criterias.. but I think we may still want to know when > this > > > should be used, or the good reason this should be merged (if it's not > > > about > > > it outperforms others). > > > > > > Thanks, > > > > > > > > > > > > > > > Bryan Zhang (4): > > > > meson: Introduce 'qatzip' feature to the build system > > > > migration: Add migration parameters for QATzip > > > > migration: Introduce 'qatzip' compression method > > > > tests/migration: Add integration test for 'qatzip' compression > method > > > > > > > > hw/core/qdev-properties-system.c | 6 +- > > > > meson.build | 10 + > > > > meson_options.txt | 2 + > > > > migration/meson.build | 1 + > > > > migration/migration-hmp-cmds.c | 8 + > > > > migration/multifd-qatzip.c | 382 > +++++++++++++++++++++++++++++++ > > > > migration/multifd.h | 1 + > > > > migration/options.c | 57 +++++ > > > > migration/options.h | 2 + > > > > qapi/migration.json | 38 +++ > > > > scripts/meson-buildoptions.sh | 6 + > > > > tests/qtest/meson.build | 4 + > > > > tests/qtest/migration-test.c | 35 +++ > > > > 13 files changed, 551 insertions(+), 1 deletion(-) > > > > create mode 100644 migration/multifd-qatzip.c > > > > > > > > -- > > > > Yichen Wang > > > > > > > > > > -- > > > Peter Xu > > > > -- > Peter Xu