Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions

Maciej S. Szmigiero Thu, 03 Oct 2024 13:35:37 -0700

On 2.10.2024 23:25, Peter Xu wrote:

On Wed, Oct 02, 2024 at 10:11:33PM +0200, Maciej S. Szmigiero wrote:

On 1.10.2024 23:30, Peter Xu wrote:

On Tue, Oct 01, 2024 at 10:41:14PM +0200, Maciej S. Szmigiero wrote:

On 30.09.2024 23:57, Peter Xu wrote:

On Mon, Sep 30, 2024 at 09:25:54PM +0200, Maciej S. Szmigiero wrote:

On 27.09.2024 02:53, Peter Xu wrote:

On Fri, Sep 27, 2024 at 12:34:31AM +0200, Maciej S. Szmigiero wrote:

On 20.09.2024 18:45, Peter Xu wrote:

On Fri, Sep 20, 2024 at 05:23:08PM +0200, Maciej S. Szmigiero wrote:

On 19.09.2024 23:11, Peter Xu wrote:

On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:

On 9.09.2024 22:03, Peter Xu wrote:

On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>


load_finish SaveVMHandler allows migration code to poll whether
a device-specific asynchronous device state loading operation had finished.

In order to avoid calling this handler needlessly the device is supposed
to notify the migration code of its possible readiness via a call to
qemu_loadvm_load_finish_ready_broadcast() while holding
qemu_loadvm_load_finish_ready_lock.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com>
---

(..)

As I wrote above, the kernel side of things are being taken care of by
the mlx5 driver maintainers.

And these performance numbers suggest that there isn't some global lock
serializing all device accesses as otherwise it would quickly become
the bottleneck and we would be seeing diminishing improvement from
increased VF count instead of increased improvement.


Personally I am not satisfied with scaling with these numbers..

   1VF       2VFs      4VFs
   274 ms -> 434 ms -> 1068 ms

The lock doesn't need to be as stupid as a global lock that all ioctl()s
take and it might not be as obvious that we can easily see.  It can hide
internally, it can be not in the form of a lock at all.

1068 is almost 4x of 274 here, that's really not scalable at all even if it
is improvement for sure..  I still feel like something is off.  If you
think kernel isn't the bottleneck, I am actually more curious on why,
especially if that could be relevant to the qemu design.


These are 4 VFs of a single PF NIC, so it's not only kernel driver
involved here but also the whole physical device itself.

Without the userspace/QEMU side being parallelized it was hard to even
measure the driver/device-side bottlenecks.

However, even with the current state of things we still get a nice 67%
improvement in downtime.

As I wrote yesterday, AFAIK it is a WIP also on the mlx5/device side of
things.

(..)

      - How qemu_loadvm_load_finish_ready_broadcast() interacts with all
        above..

So if you really think it matters to load whatever VFIO device who's
iterable data is ready first, then let's try come up with some better
interface..  I can try to think about it too, but please answer me
questions above so I can understand what I am missing on why that's
important.  Numbers could help, even if 4 VF and I wonder how much diff
there can be.  Mostly, I don't know why it's slow right now if it is; I
thought it should be pretty fast, at least not a concern in VFIO migration
world (which can take seconds of downtime or more..).

IOW, it sounds more reasonalbe to me that no matter whether vfio will
support multifd, it'll be nice we stick with vfio_load_state() /
vfio_save_state() for config space, and hopefully it's also easier it
always go via the main channel to everyone.  In these two hooks, VFIO can
do whatever it wants to sync with other things (on src, sync with
concurrent thread pool saving iterable data and dumping things to multifd
channels; on dst, sync with multifd concurrent loads). I think it can
remove the requirement on the load_finish() interface completely.  Yes,
this can only load VFIO's pci config space one by one, but I think this is
much simpler, and I hope it's also not that slow, but I'm not sure.


To be clear, I made a following diagram describing how the patch set
is supposed to work right now, including changing per-device
VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE into a common MIG_CMD_SWITCHOVER.

Time flows on it left to right (->).

----------- DIAGRAM START -----------
Source overall flow:
Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable                 
                                                         -> non iterable
Multifd channels:                                       \ multifd device state 
read and queue (1) -> multifd config data read and queue (1) /

Target overall flow:
Main channel: live VM phase data -> MIG_CMD_SWITCHOVER -> iterable -> non iterable 
-> config data load operations
Multifd channels:                                       \ multifd device state (1) 
-> multifd config data read (1)

Target config data load operations flow:
multifd config data read (1) -> config data load (2)

Notes:
(1): per device threads running in parallel


Here I raised this question before, but I'll ask again: do you think we can
avoid using a separate thread on dest qemu, but reuse multifd recv threads?

Src probably needs its own threads because multifd sender threads takes
request, so it can't block on its own.

However dest qemu isn't like that, it's packet driven so I think maybe it's
ok VFIO directly loads the data in the multifd threads.  We may want to
have enough multifd threads to make sure IO still don't block much on the
NIC, but I think tuning the num of multifd threads should work in this
case.


We need to have the receiving threads decoupled from the VFIO device state
loading threads at least because otherwise:
1) You can have a deadlock if device state for multiple devices arrives
out of order, like here:

Time flows left to right (->).
Multifd channel 1: (VFIO device 1 buffer 2) (VFIO device 2 buffer 1)
Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 1)

Both channel receive/load threads would be stuck forever in this case,
since they can't load buffer 2 for devices 1 and 2 until they load
buffer 1 for each of these devices.

2) If devices are loading buffers at different speeds you don't want
to block the faster device from receiving new buffer just because
the slower one hasn't finished its loading yet.


I don't see why it can't be avoided.  Let me draw this in columns.

How I picture this is:

     multifd recv thread 1                     multifd recv thread 2
     ---------------------                     ---------------------
     recv VFIO device 1 buffer 2             recv VFIO device 2 buffer 2
      -> found that (dev1, buf1) missing,      -> found that (dev2, buf1) 
missing,
         skip load                                skip load
     recv VFIO device 2 buffer 1             recv VFIO device 1 buffer 1
      -> found that (dev2, buf1+buf2) ready,   -> found that (dev1, buf1+buf2) 
ready,
         load buf1+2 for dev2 here                load buf1+2 for dev1 here
Here right after one multifd thread recvs a buffer, it needs to be injected
into the cache array (with proper locking), so that whoever receives a full
series of those buffers will do the load (again, with proper locking..).

Would this not work?


For sure but that's definitely more complicated logic than just having
a simple device loading thread that naturally loads incoming buffers
for that device in-order.


I thought it was mostly your logic that was implemented, but yeah I didn't
check too much details on VFIO side.

That thread isn't even in the purview of the migration code since
it's a VFIO driver internal implementation detail.

And we'd still lose parallelism if it happens that two buffers that
are to be loaded next for two devices happen to arrive in the same
multifd channel:
Multifd channel 1: (VFIO device 1 buffer 1) (VFIO device 2 buffer 1)
Multifd channel 2: (VFIO device 2 buffer 2) (VFIO device 1 buffer 2)

Now device 2 buffer 1 has to wait until loading device 1 buffer 1
finishes even thought with the decoupled loading thread implementation
from this patch set these would be loaded in parallel.


Well it's possible indeed, but with normally 8 or more threads being there,
possibility of having such dependency is low.

Cedric has similar comment on starting from simple on the thread model.
I'd still suggest if ever possible we try reuse multifd recv threads; I do
expect the results should be similar.

I am sorry to ask for this, Fabiano already blames me for this, but..
logically it'll be best we use no new thread in the series, then one patch
on top with your new thread solution to justify its performance benefits
and worthwhile to having those threads at all.


To be clear, these loading threads are mostly blocking I/O threads, NOT
compute threads.
This means that the usual "rule of thumb" that the count of threads should
not exceed the total number of logical CPUs does NOT apply to them.

They are similar to what glibc uses under the hood to simulate POSIX AIO
(aio_read(), aio_write()), to implement an async DNS resolver (getaddrinfo_a())
and what Glib's GIO uses to simulate its own async file operations.
Using helper threads for turning blocking I/O into "AIO" is a pretty common
thing.

To show that these loading threads mostly spend their time sleeping (waiting
for I/O) I made a quick patch at [1] tracing how much time they spend waiting
for incoming buffers and how much time they spend waiting for these buffers
to be loaded into the device.

The results (without patch [2] described later) are like this:

5919@1727974993.403280:vfio_load_state_device_buffer_start  (0000:af:00.2)
5921@1727974993.407932:vfio_load_state_device_buffer_start  (0000:af:00.4)
5922@1727974993.407964:vfio_load_state_device_buffer_start  (0000:af:00.5)
5920@1727974993.408480:vfio_load_state_device_buffer_start  (0000:af:00.3)
5920@1727974993.666843:vfio_load_state_device_buffer_end  (0000:af:00.3) wait 
43 ms load 217 ms
5921@1727974993.686005:vfio_load_state_device_buffer_end  (0000:af:00.4) wait 
75 ms load 206 ms
5919@1727974993.686054:vfio_load_state_device_buffer_end  (0000:af:00.2) wait 
69 ms load 210 ms
5922@1727974993.689919:vfio_load_state_device_buffer_end  (0000:af:00.5) wait 
79 ms load 204 ms


Summing up:
0000:af:00.2 total loading time 283 ms, wait 69 ms load 210 ms
0000:af:00.3 total loading time 258 ms, wait 43 ms load 217 ms
0000:af:00.4 total loading time 278 ms, wait 75 ms load 206 ms
0000:af:00.5 total loading time 282 ms, wait 79 ms load 204 ms

In other words, these threads spend ~100% of their total runtime waiting
for I/O, 70%-75% of that time waiting for buffers to get loaded into their
target device.

So having more threads here won't negatively affect the host CPU
consumption since these threads barely use the host CPU at all.
Also, their count is capped at the number of VFIO devices in the VM.

I also did a quick test with the same config as usual: 4 VFs, 6 multifd
channels, but with patch at [2] simulating forced coupling of loading
threads to multifd receive channel threads.

With this patch load_state_buffer() handler will return to the multifd
channel thread only when the loading thread finishes loading available
buffers and is about to wait for the next buffers to arrive - just as
loading buffers directly from these channel threads would do.

The resulting lowest downtime from 115 live migration runs was 1295ms -
that's 21% worse than 1068ms of downtime with these loading threads running
on their own.

I expect that this performance penalty to get even worse with more VFs
than 4.

So no, we can't load buffers directly from multifd channel receive threads.

PS: I'd suggest if you really need those threads it should still be managed
by migration framework like the src thread pool.  Sorry I'm pretty stubborn
on this, especially after I notice we have query-migrationthreads API just
recently.. even if now I'm not sure whether we should remove that API.  I
assume that shouldn't need much change, even if necessary.


I can certainly make these loading threads managed in a thread pool if that's
easier for you.

Thanks,


Thanks,
Maciej

[1]: 
https://github.com/maciejsszmigiero/qemu/commit/b0833053359715c604070f64fc058f90ec61d180
[2]: 
https://github.com/maciejsszmigiero/qemu/commit/0c9b4072eaebf8e7bd9560dd27a14cd048097565

Re: [PATCH v2 08/17] migration: Add load_finish handler and associated functions

Reply via email to