A while ago there was a conversation on the #vfio-users irc channel about how to use cpuset/pinning to get the best latency and performance. I said I would run some tests and eventually did. Writing up the result took a lot of time and there are some more test I want to run to verify the results but don't have time to do that now. I'll just post what I've concluded instead. First some theory.
Latency in a virtual environment have many difference causes. * There is latency in the hardware/bios like system management interrupts. * The host operating system introduce some latency. This is often because the host won't schedule the VM when it wants to run. * The emulator got some latency because of things like nested page tables and handling of virtual hardware. * The guest OS introduce it's own latency when the workload wants to run but the guest scheduler won't schedule it. Point 1 and 4 are latencies you get even on bare metal but point 2 and 3 is extra latency caused by the virtualisation. This post is mostly about reducing the latency of point 2. I assume you are already familiar with how this is usually done. By using cpuset you can reserve some cores for exclusive use by the VM and put all system processes on a separate housekeeping core. This allows the VM to run whenever it wants which is good for latency but the downside is the VM can't use the housekeeping core so performance is reduced. By running pstree -p when the VM is running you get some output like this: ... ─qemu-system-x86(4995)─┬─{CPU 0/KVM}(5004) ├─{CPU 1/KVM}(5005) ├─{CPU 2/KVM}(5006) ├─{CPU 3/KVM}(5007) ├─{CPU 4/KVM}(5008) ├─{CPU 5/KVM}(5009) ├─{qemu-system-x86}(4996) ├─{qemu-system-x86}(5012) ├─{qemu-system-x86}(5013) ├─{worker}(5765) └─{worker}(5766) Qemu spawn a bunch of threads for different things. The "CPU #/KVM" threads runs the actual guest code and there is one for each virtual cpu. I call them "VM threads" from here on. The qemu-system-x86 threads are used to emulate virtual hardware and is called the emulator in libvirt terminology. I call them "emulator threads". The worker threads are probably what libvirt calls iothreads but I treat them the same as the emulator threads and refer to them both as "emulator threads". My cpu is a i7-4790K with 4 hyper threaded cores for a total of 8 logical cores. A lot of people here probably have something similar. Take a look in /proc/cpuinfo to see how it's laid out. I number my cores like cpuinfo where I got physical cores 0-3 and logical cores 0-7. pcore 0 corresponds to lcore 0,4 and pcore 1 is lcore 1,5 and so on. The goal is to partition the system processes, VM threads and emulator threads on these 8 lcores to get good latency and acceptable performance but to do that I need a way to measure latency. Mainline kernel 4.9 got a new latency tracer called hwlat. It's designed to measure hardware latencies like SMI but if you run it in a VM you get all latencies below the guest (point 1-3 above). Hwlat bypasses the normal cpu scheduler so it won't measure any latency from the guest scheduler (point 4). It basically makes it possible to focus on just the VM related latencies. https://lwn.net/Articles/703129/ We should perhaps also discuss how much latency is too much. That's up for debate but the windows DPC latency checker lists 500us as green, 1000us as yellow and 2000us as red. If a game runs at 60fps it has a deadline of 16.7ms to render a frame. I'll just decide that 1ms (1000us) is the upper limit for what I can tolerate. One of the consequences of how hwlat works is that it also fails to notice a lot of the point 3 types of latencies. Most of the latency in point 3 is caused by vm-exits. That's when the guest do something the hardware virtualisation can't handle and have to rely on kvm or qemu to emulate the behaviour. This is a lot slower than real hardware but it mostly only happens when the guest tries to access hardware resources, so I'll call it IO-latency. The hwlat tracer only sits and spins in kernel space and never touch any hardware by itself. Since hwlat don't trigger vm-exits it also can't measure latencies from that so it would be good to have something else that could. They way I rigged things up is to set the virtual disk controller to ahci. I know that has to be emulated by qemu. I then added a ram block device from /dev/ram* to the VM as a virtual disk. I can then run the fio disk benchmark in the VM on that disk to trigger vm-exits and get a report on the latency from fio. It's not a good solution but it's the best I could come up with. http://freecode.com/projects/fio === Low latency setup === Let's finally get down to business. The first setup I tried is configured for minimum latency at the expense of performance. The virtual cpu in this setup got 3 cores and no HT. The VM threads are pinned to lcore 1,2,3. The emulator threads are pinned to lcore 5,6,7. That leaves pcore 0 which is dedicated to the host using cpuset. Here is the layout in libvirt xml <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='2'/> <vcpupin vcpu='2' cpuset='3'/> <emulatorpin cpuset='5-7'/> <topology sockets='1' cores='3' threads='1'/> And here are the result of hwlat (all hwlat test run for 30 min each). I used a synthetic load to test how the latencies changed under load. I use the program stress as synthetic load on both guest and host (stress --vm 1 --io 1 --cpu 8 --hdd 1). mean stdev max(us) host idle, VM idle: 17.2778 15.6788 70 host load, VM idle: 21.4856 20.1409 72 host idle, VM load: 19.7144 18.9321 103 host load, VM load: 21.8189 21.2839 139 As you can see the load on the host makes little difference for the latency. The cpuset isolation works well. The slight decrease of the mean might be because of reduced memory bandwidth. Putting the VM under load will increase the latency a bit. This might seem odd since the idea of using hwlat was to bypass the guest scheduler thereby making the latency independent of what is running in the guest. What is probably happening is that the "--hdd" part of the stress access the disk and this makes the emulator threads run. They are pinned to the HT siblings of the VM threads and thereby slightly impact the latency of them. Overall the latency is very good in this setup. fio (us) min=40, max=1306, avg=52.81, stdev=12.60 iops=18454 Here is the result of the io latency test with fio. Since the emulator treads are running mostly isolated on their own siblings this result must be considered good. === Low latency setup, with realtime === In an older post to the mailing list I said "The NO_HZ_FULL scheduler mode only works if a single process wants to run on a core. When the VM thread runs as realtime priority it can starve the kernel threads for long period of time and the scheduler will turn off NO_HZ_FULL when that happens since several processes wants to run. To get the full advantage of NO_HZ_FULL don't use realtime priority." Let's see how much impact this really has. The idea behind realtime pri is to always give your preferred workload priority over unimportant workloads. But to make any difference there has to be an unimportant workload to preempt. Cpuset is a great way to move unimportant processes to a housekeeping cpu but unfortunately the kernel got some pesky kthreads that refuse to migrate. By using realtime pri on the VM threads I should be able to out-preempt the kernel threads and get lower latency. In this test I used the same setup as above but I used schedtool to set round-robin pri 1 on all VM related threads. mean stdev max(us) host idle, VM idle: 17.6511 15.3028 61 host load, VM idle: 20.2400 19.6558 57 host idle, VM load: 18.9244 18.8119 108 host load, VM load: 20.4228 21.0749 122 The result is mostly the same. Those few remaining kthreads that I can't disable or migrate apparently doesn't make much difference on latency. === Balanced setup, emulator with VM threads === 3 cores isn't a lot these days and some games like Mad max and Rise of the tomb raider max out the cpu in the low latency setup. This results in big frame drops when that happens. The setup below with a virtual 2 core HT cpu would probably give ok latency but the addition of hyper threading usually only give 25-50% extra performance for real world workloads so this setup would generally be slower than the low latency setup. I didn't bother to test it. <vcpupin vcpu='0' cpuset='2'/> <vcpupin vcpu='1' cpuset='6'/> <vcpupin vcpu='2' cpuset='3'/> <vcpupin vcpu='3' cpuset='7'/> <emulatorpin cpuset='1,5'/> <topology sockets='1' cores='2' threads='2'/> To get better performance I need at least a virtual 3 core HT cpu but if the host use pcore 0 and the VM threads use pcore 1-3 where will the emulator threads run? I could overallocate the system by having the emulator threads compete with the VM threads or I could overallocate the system by having the emulator threads compete with the host processes. Lets try to run the emulator with the VM treads first. <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='5'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='6'/> <vcpupin vcpu='4' cpuset='3'/> <vcpupin vcpu='5' cpuset='7'/> <emulatorpin cpuset='1-3,5-7'/> <topology sockets='1' cores='3' threads='2'/> The odd ordering for vcpupin is done because Intel cpus lay out HT siblings as lcore[01234567] = pcore[01230123] but qemu lays out the virtual cpu as lcore[012345] = pcore[001122]. To get a 1:1 mapping I have to order them like that. mean stdev max(us) host idle, VM idle: 17.4906 15.1180 89 host load, VM idle: 22.7317 19.5327 95 host idle, VM load: 82.3694 329.6875 9458 host load, VM load: 141.2461 1170.5207 20757 The result is really bad. It works ok as long as the VM is idle but as soon as it's under load I get bad latencies. The reason is likely that the stressor accesses the disk which activates the emulator and in this setup the emulator can preempt the VM threads. We can check if this is the case by running the stress without "--hdd". mean stdev max(us) host load, VM load(but no --hdd): 57.4728 138.8211 1345 The latency is reduced quite a bit but it's still high. It's likely still the emulator threads preempting the VM threads. Accessing the disk is just one of many things the VM can do to activate the emulator. fio (us) min=41, max=7348, avg=62.17, stdev=14.99 iops=15715 io latency is also a lot worse compared to the low latency setup. The reason is the VM threads can preempt the emulator threads while they are emulating the disk drive. === Balanced setup, emulator with host === Pairing up the emulator threads and VM threads was a bad idea so lets try running the emulator on the core reserved for the host. Since the VM threads run by themselves in this setup we would expect to get good hwlat latency but the emulator threads can be preempted by host processes so io latency might suffer. Lets start by looking at the io latency. fio (us) min=40, max=46852, avg=61.55, stdev=250.90 iops=15893 Yup, massive io latency. Here is a situation were realtime pri could help. If the emulator threads get realtime pri they can out-preempt the host processes. Lets try that. fio (us) min=38, max=2640, avg=53.72, stdev=13.61 iops=18140 That's better but it's not as good as the low latency setup where the emulator threads got their own lcore. To reduce the latency even more we could try to split pcore 0 in two and run host processes on lcore 0 and the emulator threads on lcore 4. But this doesn't leave much cpu for the emulator (or the host). fio (us) min=44, max=1192, avg=56.07, stdev=8.52 iops=17377 The max io latency now decreased to the same level as the low latency setup. Unfortunately the number of iops also decreased a bit (down 5.8% compared to the low latency setup). I'm guessing this is because the emulator threads don't get as much cpu power in this setup. mean stdev max(us) host idle, VM idle: 18.3933 15.5901 106 host load, VM idle: 20.2006 18.8932 77 host idle, VM load: 23.1694 22.4301 110 host load, VM load: 23.2572 23.7288 120 Hwlat latency is comparable to the low latency setup so this setup gives a good latency / performance trade-off === Max performance setup === If 3 cores with HT isn't enough I suggest you give up but for comparison let's see what happens if we mirror the host cpu in the VM. Now we have no room at all for the emulator or the host processes so I let them schedule free. <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='4'/> <vcpupin vcpu='2' cpuset='1'/> <vcpupin vcpu='3' cpuset='5'/> <vcpupin vcpu='4' cpuset='2'/> <vcpupin vcpu='5' cpuset='6'/> <vcpupin vcpu='6' cpuset='3'/> <vcpupin vcpu='7' cpuset='7'/> <emulatorpin cpuset='0-7'/> <topology sockets='1' cores='4' threads='2'/> mean stdev max(us) host idle, VM idle: 185.4200 839.7908 6311 host load, VM idle: 3835.9333 7836.5902 97234 host idle, VM load: 1891.4300 3873.9165 31015 host load, VM load: 8459.2550 6437.6621 51665 fio (us) min=48, max=112484, avg=90.41, stdev=355.10 iops=10845 I only ran these tests for 10 min each. That's all that was needed. As you can see it's terrible. I'm afraid that many people probably run a setup similar to this. I ran like this myself for a while until I switched to libvirt and started looking into pinning. Realtime pri would probably help a lot here but realtime in this configuration is potentially dangerous. Workloads on the guest could starve the host and depending on how the guest gets its input a reset using the hardware reset button could be needed to get the system back. === Testing with games === I want low latency for gaming so it would make sense to test the setups with games. This turns out to be kind of tricky. Games are complicated and interpreting the results can be hard. https://i.imgur.com/NIrXnkt.png as an example here is a percentile plot of the frametimes in the built in benchmark of rise of the tomb raider taken with fraps. The performance and balanced setups looks about the same at lower percentiles but the low latency setup is a lot lower. This means that the low latency setup, with is the weakest in terms of cpu power, got higher frame rate for some parts of the benchmark. This doesn't make sense at first. It only starts to make sense if I pay attention to the benchmark while it's running. Rise of the tomb raider loads in a lot of geometry dynamically and the low latency setup can't keep up. It has bad pop-in of textures and objects so the scene the gpu renders is less complicated than the other setups. Less complicated scene results in higher frame rate. An odd counter intuitive result. Overall the performance and balanced setups have the same percentile curve for lower percentiles in every game I tested. This tells me that the balanced setup got enough cpu power for all games I've tried. They only differ at higher percentile due to latency induced framedrops. The performance setup always have the worst max frametime in every game so there is no reason to use it over the balanced setup. The performance setup also have crackling sound in several games over hdmi audio even with MSI enabled. Which setup got the lowest max framtime depends on the workload. If the game max out the cpu of the low latency setup the max framtime will be worse than the balanced setup, if not the low latency setup got the best latency. === Conclusion === The balanced setup (emulators with host) doesn't have the best latency in every workload but I haven't found any workload where it performs poorly in regards to max latency, io latency or available cpu power. Even in those workloads where another setup performed better the balanced setup was always close. If you are too lazy to switch setups depending on the workload use the balanced setup as the default configuration. If your cpu isn't a 4 core with HT finding the best setup for your cpu is left as an exercise for the reader. === Future work === https://vfio.blogspot.se/2016/10/how-to-improve-performance-in-windows-7.html This was a nice trick for forcing win7 to use TSC. Just one problem, turns out it doesn't work if hyper threading is enabled. Any time I use a virtual cpu with threads='2' win7 will revert to using the acpi_pm. I've spent a lot of time trying to work around the problem but failed. I don't even know why hyper threading would make a difference for TSC. Microsoft's documentation is amazingly unhelpful. But even when the guest is hammering the acpi_pm timer the balanced setup gives better performance than the low latency setup but I'm afraid the reduced resolution and extra indeterminism of the acpi_pm timer might result in other problems. This is only a problem in win7 because modern versions if windows should use hypervclock. I've read somewhere that it might be possible to modify OVMF to work around the bug in win7 that prevents hyperv from working. With that modification it might be possible to use hypervclock in win7. Perhaps I'll look into that in the future. In the mean time I'll stick with the balanced setup despite the use of acpi_pm. _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users