On Wed, Jun 21, 2017 at 12:47:27PM +0200, Henning Schild wrote: > Am Tue, 20 Jun 2017 10:04:30 -0400 > schrieb Luiz Capitulino <lcapitul...@redhat.com>: > > > On Tue, 20 Jun 2017 09:48:23 +0200 > > Henning Schild <henning.sch...@siemens.com> wrote: > > > > > Hi, > > > > > > We are using OpenStack for managing realtime guests. We modified > > > it and contributed to discussions on how to model the realtime > > > feature. More recent versions of OpenStack have support for > > > realtime, and there are a few proposals on how to improve that > > > further. > > > > > > But there is still no full answer on how to distribute threads > > > across host-cores. The vcpus are easy but for the emulation and > > > io-threads there are multiple options. I would like to collect the > > > constraints from a qemu/kvm perspective first, and than possibly > > > influence the OpenStack development > > > > > > I will put the summary/questions first, the text below provides more > > > context to where the questions come from. > > > - How do you distribute your threads when reaching the really low > > > cyclictest results in the guests? In [3] Rik talked about problems > > > like hold holder preemption, starvation etc. but not where/how to > > > schedule emulators and io > > > > We put emulator threads and io-threads in housekeeping cores in > > the host. I think housekeeping cores is what you're calling > > best-effort cores, those are non-isolated cores that will run host > > load. > > As expected, any best-effort/housekeeping core will do but overlap with > the vcpu-cores is a bad idea. > > > > - Is it ok to put a vcpu and emulator thread on the same core as > > > long as the guest knows about it? Any funny behaving guest, not > > > just Linux. > > > > We can't do this for KVM-RT because we run all vcpu threads with > > FIFO priority. > > Same point as above, meaning the "hw:cpu_realtime_mask" approach is > wrong for realtime. > > > However, we have another project with DPDK whose goal is to achieve > > zero-loss networking. The configuration required by this project is > > very similar to the one required by KVM-RT. One difference though is > > that we don't use RT and hence don't use FIFO priority. > > > > In this project we've been running with the emulator thread and a > > vcpu sharing the same core. As long as the guest housekeeping CPUs > > are idle, we don't get any packet drops (most of the time, what > > causes packet drops in this test-case would cause spikes in > > cyclictest). However, we're seeing some packet drops for certain > > guest workloads which we are still debugging. > > Ok but that seems to be a different scenario where hw:cpu_policy > dedicated should be sufficient. However if the placement of the io and > emulators has to be on a subset of the dedicated cpus something like > hw:cpu_realtime_mask would be required. > > > > - Is it ok to make the emulators potentially slow by running them on > > > busy best-effort cores, or will they quickly be on the critical > > > path if you do more than just cyclictest? - our experience says we > > > don't need them reactive even with rt-networking involved > > > > I believe it is ok. > > Ok. > > > > Our goal is to reach a high packing density of realtime VMs. Our > > > pragmatic first choice was to run all non-vcpu-threads on a shared > > > set of pcpus where we also run best-effort VMs and host load. > > > Now the OpenStack guys are not too happy with that because that is > > > load outside the assigned resources, which leads to quota and > > > accounting problems. > > > > > > So the current OpenStack model is to run those threads next to one > > > or more vcpu-threads. [1] You will need to remember that the vcpus > > > in question should not be your rt-cpus in the guest. I.e. if vcpu0 > > > shares its pcpu with the hypervisor noise your preemptrt-guest > > > would use isolcpus=1. > > > > > > Is that kind of sharing a pcpu really a good idea? I could imagine > > > things like smp housekeeping (cache invalidation etc.) to eventually > > > cause vcpu1 having to wait for the emulator stuck in IO. > > > > Agreed. IIRC, in the beginning of KVM-RT we saw a problem where > > running vcpu0 on an non-isolated core and without FIFO priority > > caused spikes in vcpu1. I guess we debugged this down to vcpu1 > > waiting a few dozen microseconds for vcpu0 for some reason. Running > > vcpu0 on a isolated core with FIFO priority fixed this (again, this > > was years ago, I won't remember all the details). > > > > > Or maybe a busy polling vcpu0 starving its own emulator causing high > > > latency or even deadlocks. > > > > This will probably happen if you run vcpu0 with FIFO priority. > > Two more points that indicate that hw:cpu_realtime_mask (putting > emulators/io next to any vcpu) does not work for general rt. > > > > Even if it happens to work for Linux guests it seems like a strong > > > assumption that an rt-guest that has noise cores can deal with even > > > more noise one scheduling level below. > > > > > > More recent proposals [2] suggest a scheme where the emulator and io > > > threads are on a separate core. That sounds more reasonable / > > > conservative but dramatically increases the per VM cost. And the > > > pcpus hosting the hypervisor threads will probably be idle most of > > > the time. > > > > I don't know how to solve this problem. Maybe if we dedicate only one > > core for all emulator threads and io-threads of a VM would mitigate > > this? Of course we'd have to test it to see if this doesn't give > > spikes. > > [2] suggests exactly that but it is a waste of pcpus. Say a vcpu needs > 1.0 cores and all other threads need 0.05 cores. The real need of a 1 > core rt-vm would be 1.05 for two it would be 2.05. > With [1] we pack 2.05 onto 2 pcpus, that does not work. With [2] we > need 3 and waste 0.95. > > > > I guess in this context the most important question is whether qemu > > > is ever involved in "regular operation" if you avoid the obvious IO > > > problems on your critical path. > > > > > > My guess is that just [1] has serious hidden latency problems and > > > [2] is taking it a step too far by wasting whole cores for idle > > > emulators. We would like to suggest some other way inbetween, that > > > is a little easier on the core count. Our current solution seems to > > > work fine but has the mentioned quota problems. > > > > What is your solution? > > We have a kilo-based prototype that introduced emulator_pin_set in > nova.conf. All vcpu threads will be scheduled on vcpu_pin_set and > emulators and IO of all VMs will share emulator_pin_set. > vcpu_pin_set contains isolcpus from the host and emulator_pin_set > contains best-effort cores from the host. > That basically means you put all emulators and io of all VMs onto a set > of cores that the host potentially also uses for other stuff. Sticking > with the made up numbers from above, all the 0.05s can share pcpus. > > With the current implementation in mitaka (hw:cpu_realtime_mask) you > can not have a single-core rt-vm because you can not put 1.05 into 1 > without overcommitting. You can put 2.05 into 2 but as you confirmed > the overcommitted core could still slow down the truly exclusive one. > On a 4-core host you get a maximum of 1 rt-VMs (2-3 cores). > > With [2], which is not implemented yet, the overcommitting is avoided. > But now you waste a lot of pcpus. 1.05 = 2, 2.05 = 3 > On a 4-core host you get a maximum of 1 rt-VMs (1-2 cores). > > With our approach it might be hard to account for emulator and > io-threads because they share pcpus. But you do not run into > overcommitting and don't waste pcpus at the same time. > On a 4-core host you get a maximum of 3 rt-VMs (1 core), 1 rt-VMs (2-3 > cores)
I think your solution is good. In Linux RT context, and as you mentioned, the non-RT vCPU can acquire some guest kernel lock, then be pre-empted by emulator thread while holding this lock. This situation blocks RT vCPUs from doing its work. So that is why we have implemented [2]. For DPDK I don't think we have such problems because it's running in userland. So for DPDK context I think we could have a mask like we have for RT and basically considering vCPU0 to handle best effort works (emulator threads, SSH...). I think it's the current pattern used by DPDK users. For RT we have to isolate the emulator threads to an additional pCPU per guests or as your are suggesting to a set of pCPUs for all the guests running. I think we should introduce a new option: - hw:cpu_emulator_threads_mask=^1 If on 'nova.conf' - that mask will be applied to the set of all host CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs running here (useful for RT context). If on flavor extra-specs It will be applied to the vCPUs dedicated for the guest (useful for DPDK context). s. > Henning > > > > With this mail i am hoping to collect some constraints to derive a > > > suggestion from. Or maybe collect some information that could be > > > added to the current blueprints as reasoning/documentation. > > > > > > Sorry if you receive this mail a second time, i was not subscribed > > > to openstack-dev the first time. > > > > > > best regards, > > > Henning > > > > > > [1] > > > https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html > > > [2] > > > https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html > > > [3] > > > http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf > > > > > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev