> -----Original Message----- > From: Juan Quintela <quint...@redhat.com> > Sent: Monday, October 23, 2023 6:39 PM > To: Liu, Yuan1 <yuan1....@intel.com> > Cc: Daniel P.Berrangé <berra...@redhat.com>; Peter Xu > <pet...@redhat.com>; faro...@suse.de; leob...@redhat.com; qemu- > de...@nongnu.org; Zou, Nanhai <nanhai....@intel.com> > Subject: Re: [PATCH 0/5] Live Migration Acceleration with IAA Compression > > "Liu, Yuan1" <yuan1....@intel.com> wrote: > >> -----Original Message----- > >> From: Daniel P. Berrangé <berra...@redhat.com> > >> Sent: Thursday, October 19, 2023 11:32 PM > >> To: Peter Xu <pet...@redhat.com> > >> Cc: Juan Quintela <quint...@redhat.com>; Liu, Yuan1 > >> <yuan1....@intel.com>; faro...@suse.de; leob...@redhat.com; qemu- > >> de...@nongnu.org; Zou, Nanhai <nanhai....@intel.com> > >> Subject: Re: [PATCH 0/5] Live Migration Acceleration with IAA > >> Compression > >> > >> On Thu, Oct 19, 2023 at 11:23:31AM -0400, Peter Xu wrote: > >> > On Thu, Oct 19, 2023 at 03:52:14PM +0100, Daniel P. Berrangé wrote: > >> > > On Thu, Oct 19, 2023 at 01:40:23PM +0200, Juan Quintela wrote: > >> > > > Yuan Liu <yuan1....@intel.com> wrote: > >> > > > > Hi, > >> > > > > > >> > > > > I am writing to submit a code change aimed at enhancing live > >> > > > > migration acceleration by leveraging the compression > >> > > > > capability of the Intel In-Memory Analytics Accelerator (IAA). > >> > > > > > >> > > > > Enabling compression functionality during the live migration > >> > > > > process can enhance performance, thereby reducing downtime > >> > > > > and network bandwidth requirements. However, this improvement > >> > > > > comes at the cost of additional CPU resources, posing a > >> > > > > challenge for cloud service providers in terms of resource > >> > > > > allocation. To address this challenge, I have focused on > >> > > > > offloading the compression > >> overhead to the IAA hardware, resulting in performance gains. > >> > > > > > >> > > > > The implementation of the IAA (de)compression code is based > >> > > > > on Intel Query Processing Library (QPL), an open-source > >> > > > > software project designed for IAA high-level software programming. > >> > > > > > >> > > > > Best regards, > >> > > > > Yuan Liu > >> > > > > >> > > > After reviewing the patches: > >> > > > > >> > > > - why are you doing this on top of old compression code, that is > >> > > > obsolete, deprecated and buggy > > Some users have not enabled the multifd feature yet, but they will > > decide whether to enable the compression feature based on the load > > situation. So I'm wondering if, without multifd, the compression > > functionality will no longer be available? > > Next pull request will deprecate it. So in two versions is going to be gone. > > >> > > > - why are you not doing it on top of multifd. > > > I plan to submit the support for multifd independently because the > > multifd compression and legacy compression code are separate. > > compression code is really buggy. I think you should not even try to work on > top of it. Sure, I will focus on multifd compression in the future.
> > I looked at the code of multifd about compression. Currently, it uses > > the CPU synchronous compression mode. Since it is best to use the > > asynchronous processing method of the hardware accelerator, I would > > like to get suggestions on the asynchronous implementation. > > I did that on a previous comment. > Several questions: > > - you are using zlib, right? When I tested, the longer streams you > have, the better compression you get. right? > Is there a way to "continue" with the state of the previous job? > > Old compression code, generates a new context for every packet. > Multifd generates a new zlib context for each connection. Sorry, I'm not familiar with zlib development. In most cases, the longer the input data, the higher the compression ratio, one reason is that longer data can be encoded more efficiently. Deflate compression has two phases, LZ77 + Huffman coding, and as far as I know, zlib can use a static Huffman table or a dynamic Huffman table, the former has high throughput and the latter has high compression ratio, but the user can not specify a Huffman table. IAA can support this, it has a mode(canned mode) that compression can use a user-generated Huffman table to improve the compression ratio, this table also can be created by analyzing the input data using the QPL library. > > 1. Dirty page scanning and compression pipeline processing, the main > > thread of live migration submits compression tasks to the hardware, > > and multifd threads only handle the transmission of compressed pages. > > 2. Data sending and compression pipeline processing, the Multifd > > threads submit compression tasks to the hardware and then transmit the > > compressed data. (A multifd thread job may need to transmit compressed > > data multiple times.) > > > >> > > > You just need to add another compression method on top of multifd. > >> > > > See how it was done for zstd: > > Yes, I will refer to zstd to implement multifd compression with IAA > > Basically you can use two approachs here (simplifying a lot) > - for each channel > submit job (512KB) > wait for job > send compressed stuff > And you adjust the number of channels depending on how much > concurrency you want. > > > - for each channel > submit job > while (number_of_jobs_submitted > some_threshold) > wait_for_job > send job > Here you need to piggy back in the MULTIFD_FLAG_SYNC to wait for the > rest of jobs. > > Each one has its advantages/disadvantages. With the 1st, it is simpler to do, > because it is for all effects synchronous, and simpler to "contain" the > concurrency. > > With the second approach you get much more concurrency, but you need to be > careful about how much stuff do you have in flight. > > Remember that you get queueds for each multifd channel. > How much asynchronous jobs (around 512KB each packet) can current > hardware handle? I mean what is the optimus number, around 10, around 50, > around 100? Thank you very much for your detailed explanation, I will modify it accordingly > >> > > I'm not sure that is ideal approach. IIUC, the IAA/QPL library > >> > > is not defining a new compression format. Rather it is providing > >> > > a hardware accelerator for 'deflate' format, as can be made > >> > > compatible with zlib: > >> > > > >> > > > >> > > https://intel.github.io/qpl/documentation/dev_guide_docs/c_use_ca > >> > > ses > >> > > /deflate/c_deflate_zlib_gzip.html#zlib-and-gzip-compatibility-ref > >> > > ere > >> > > nce-link > >> > > > >> > > With multifd we already have a 'zlib' compression format, and so > >> > > this IAA/QPL logic would effectively just be a providing a second > >> > > implementation of zlib. > >> > > > >> > > Given the use of a standard format, I would expect to be able to > >> > > use software zlib on the src, mixed with IAA/QPL zlib on the > >> > > target, or vica-verca. > >> > > > >> > > IOW, rather than defining a new compression format for this, I > >> > > think we could look at a new migration parameter for > >> > > > >> > > "compression-accelerator": ["auto", "none", "qpl"] > >> > > > >> > > with 'auto' the default, such that we can automatically enable > >> > > IAA/QPL when 'zlib' format is requested, if running on a suitable > >> > > host. > >> > > >> > I was also curious about the format of compression comparing to > >> > software ones when reading. > >> > > >> > Would there be a use case that one would prefer soft compression > >> > even if hardware accelerator existed, no matter on src/dst? > >> > > >> > I'm wondering whether we can avoid that one more parameter but > >> > always use hardware accelerations as long as possible. > > I want to add a new compression format(QPL or IAA-Deflate) here. The > reasons are as follows: > > 1. The QPL library already supports both software and hardware paths > > for compression. > > The question is if IAA-Deflate is compatible with zlib-deflate. > What are the advantages of QPL software implementation vs zlib? > - Is it faster? > - Does it uses less resources. Yes, the QPL software path is much faster than zlib. The QPL software path is based on ISA-L (https://github.com/intel/isa-l), which is fully compatible with zlib and has several times the throughput of zlib > > The software path uses a fast Deflate compression algorithm, while the > > hardware path uses IAA. > > Is it faster than zlib? > And doing all of this asynchronous job dance is not going to be slower than > just > calling the functions in a software implementation? Yes, basically using the asynchronous method will increase the latency, I will do some tests based on the multifd solution and give a reply later > > 2. QPL's software and hardware paths are based on the Deflate > > algorithm, but there is a limitation: the history buffer only supports > > 4K. The default history buffer for zlib is 32K, which means that IAA > > cannot decompress zlib-compressed data. However, zlib can decompress > > IAA-compressed data. > > Aha. Thanks, that was what we wanted to know. > > > 3. For zlib and zstd, Intel QuickAssist Technology can accelerate both of > > them. > > Do we have any number than we could look at? > We are interested in three things: > - how faster is it > - how much cpu is saved using IAA > - how much latency does it add Sure, I will provide this data following the next version > >> Yeah, I did wonder about whether we could avoid a parameter, but then > >> I'm thinking it is good to have an escape hatch if we were to find > >> any flaws in the QPL library's impl of deflate() that caused interop > >> problems. > >> > >> With regards, > >> Daniel > >> -- > >> |: https://berrange.com -o- > https://www.flickr.com/photos/dberrange :| > >> |: https://libvirt.org -o- > >> https://fstop138.berrange.com :| > >> |: https://entangle-photo.org -o- > >> https://www.instagram.com/dberrange :|