On Fri, 23 Aug 2019, 0:27 Laine Stump, <la...@redhat.com> wrote: > (Adding Alex Williamson to Cc so he can correct any mistakes) > > On 8/22/19 4:39 PM, Ihar Hrachyshka wrote: > > On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <la...@redhat.com> wrote: > >> > >> On 8/22/19 10:56 AM, Ihar Hrachyshka wrote: > >>> On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé < > berra...@redhat.com> wrote: > >>>> > >>>> On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote: > >>>>> Hi all, > >>>>> > >>>>> KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes > >>>>> API resources. In this case, libvirtd is running inside an > >>>>> unprivileged pod, with some host mounts / capabilities added to the > >>>>> pod, needed by libvirtd and other services. > >>>>> > >>>>> One of the capabilities libvirtd requires for successful startup > >>>>> inside a pod is SYS_RESOURCE. This capability is used to adjust > >>>>> RLIMIT_MEMLOCK ulimit value depending on devices attached to the > >>>>> managed guest, both on startup and during hotplug. AFAIU the need to > >>>>> lock the memory is to avoid pages being pushed out from RAM into > swap. > >> > >> > >> I recall successfully testing GPU assignment from an unprivileged > >> libvirtd several years ago by setting a high enough ulimit for the uid > >> used to run libvirtd in advance (. I think we check if the current > >> setting is high enough, and don't try to set it unless we think we need > to. > >> > > > > The PR I linked to in the original email does just that: it starts > > libvirtd; then, if domain is going to use VFIO, sets ulimit of > > libvirtd process to VM memory size + 1Gb (mimicking libvirt code) + > > 256Mb (to stay conservative) using prlimit() syscall; then defines the > > domain. > > So you're making an educated guess, which is essentially what libvirt is > doing (based on advice from other people with better information than > us, but still a guess). > > > > >> If I understand you correctly, you're saying that in your case it's okay > >> for the memlock limit to be lower than we try to set it to, because swap > >> is disabled anyway, is that correct? > >> > > > > I'm honestly not exactly sure about the reason why we need to set the > > limit, but I assume it's because of swap. I can be totally confused on > > that part though. > > > What I understand from an IRC conversation with Alex just now is that > increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages > being swapped out. It's done because "all GPAs (Guest Physical > Addresses) that could potentially be DMA targets need to have fixed > mappings through the iommu, therefore all need to be allocated and > mappings fixed [...] setting rlimit allows us to perform all the > necessary pins within the user's locked memory limit". > > So even if swap is disabled, it still needs to be done (either by > libvirt, or by someone else who has the necessary privileges and control > over the libvirtd process). > > > >>>> > >>>> Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's > >>>> something in the XML that requires it - one of > >>> > >>> You are right, sorry. We add SYS_RESOURCE only for particular domains. > >>> > >>>> > >>>> - hard limit memory value is present > >>>> - host PCI device passthrough is requested > >>> > >>> We are using passthrough > >> > >> (If you want to make Alex happy, use the term "VFIO device assignment" > >> rather than passthrough :-).) > >> > > > > Not sure who Alex is but I'll try to make everyone happy! :) > > The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO > maintainer. > > > >>> to pass SR-IOV NIC VFs into guests. We also > >>> plan to do the same for GPUs in the near future. > >> > >> >>> I believe we would benefit from one of the following features on > >> >>> libvirt side (or both): > >> >>> > >> >>> a) expose the memory lock value calculated by libvirtd through > >> >>> libvirt ABI so that we can use it when calling prlimit() on > libvirtd > >> >>> process; > >> >>> b) allow to disable setrlimit() calls via libvirtd config file > knob > >> >>> or domain definition. > >> > >> (b) sounds much more reasonable, as long as qemu doesn't complain (I > >> don't know whether or not it checks) > >> > >> Slightly related to this - I'm currently working on patches to avoid > >> making any ioctl calls that would fail in an unprivileged libvirtd when > >> using tap/macvtap devices.
This is music to my ears, great to hear. ATM, I'm doing this by adding an attribute > >> "unmanaged='yes'" to the interface <target> element. The idea is that if > >> someone sets unmanaged='yes', they're stating that the caller (i.e. > >> kubevirt) is responsible for all device setup, and that libvirt should > >> just use it without further setup. A similar approach could be applied > >> to hostdev devices - if unmanaged is set, we assume that the caller has > >> done everything to make the associated device usable. > >> > >> (Of course this all makes me realize the inanity of adding a <target > >> dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have > >> <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So > >> to prevent setting the locklimit for hostdev, would we make a new > >> setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I > >> *hate* trying to make config consistent :-/) > Sounds tough indeed. I'd try to avoid negatively-named knobs. managed=no is simpler to perceive than unmanaged=yes. It may be just me, but I'd even assume managed=no if the target dev name is specified. If libvirt manages the tap device, it should create a fresh one, too. But all of this is a big digression. >> > >> (alternately, we could just automatically fail the attempt to set the > >> lock limit in a graceful manner and allow the guest to continue) > >> > > > > If that's something maintainers feel good about, I am all for it since > > it simplifies the implementation. > > Well, after talking to Alex, I think that since a) libvirt only attempts > to increase the limit after determining that it isn't already high > enough, and b) if it isn't high enough and we can't increase it, then > qemu is going to fail anyway, that c) we can't just fail gracefully and > continue. > > So *somebody* needs to increase the limit, and if you want libvirt to be > unprivileged, that means it needs to be you doing the increase. And > since the amount that libvirt increases it is just some number based on > oral folklore (and not on a specific value we learn by querying > somewhere), I don't think it's worthwhile figuring out some way for > libvirt to report it via an official API - that would end up just being > this: > > "Hey, you know that number that you guys are just making a guess about > based on some advice someone gave you once? Yeah, send me *that* number > so I can claim to be basing my actions on real science instead of > slightly educated voodoo! K THX BYE!" :-) > Well, it's more like: "you know that voodoo you do to guess the number? If you ever educate yourself about it, e.g by querying qemu, send me *that* number. I'd rather not think about it ever again, BYE." > > > >> BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather > >> than <interface type='hostdev'>, correct? The latter would require that > >> you have enough capabilities to set MAC addresses on the VFs (that's the > >> entire point of using <interface type='hostdev'> instead of plain > <hostdev>) > > > > Yes, we use <hostdev> exactly because interface sets MAC address: in > > kubevirt scenario, the container that is running libvirtd has its own > > network namespace and doesn't have access to PF to set the VF MAC > > address on. Instead, we rely on CNI plugin that is running in the root > > namespace context to configure the VF interface as needed. (I've > > contributed custom MAC support to SR-IOV CNI plugin very recently.) > > > > Ihar > > > > _______________________________________________ > libvirt-users mailing list > libvirt-users@redhat.com > https://www.redhat.com/mailman/listinfo/libvirt-users
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users