On Thu, Dec 19, 2019 at 12:33:15PM +0000, Felipe Franciosi wrote: > Hello, > > (I've added Jim and Ben from the SPDK team to the thread.) > > > On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote: > > > > On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote: > >>> On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote: > >>> On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote: > >>>>> On 16 Dec 2019, at 20:47, Elena Ufimtseva <elena.ufimts...@oracle.com> > >>>>> wrote: > >>>>> On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote: > >>> Questions I've seen when discussing muser with people have been: > >>> > >>> 1. Can unprivileged containers create muser devices? If not, this is a > >>> blocker for use cases that want to avoid root privileges entirely. > >> > >> Yes you can. Muser device creation follows the same process as general > >> mdev device creation (ie. you write to a sysfs path). That creates an > >> entry in /dev/vfio and the control plane can further drop privileges > >> there (set selinux contexts, &c.) > > > > In this case there is still a privileged step during setup. What about > > completely unprivileged scenarios like a regular user without root or a > > rootless container? > > Oh, I see what you are saying. I suppose we need to investigate > adjusting the privileges of the sysfs path correctly beforehand to > allow devices to be created by non-root users. The credentials used on > creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>). > > I need to look into that and get back to you. > > > > >>> 2. Does muser need to be in the kernel (e.g. slower to develop/ship, > >>> security reasons)? A similar library could be implemented in > >>> userspace along the lines of the vhost-user protocol. Although VMMs > >>> would then need to use a new libmuser-client library instead of > >>> reusing their VFIO code to access the device. > >> > >> Doing it in userspace was the flow we proposed back in last year's KVM > >> Forum (Edinburgh), but it got turned down. That's why we procured the > >> kernel approach, which turned out to have some advantages: > >> - No changes needed to Qemu > >> - No Qemu needed at all for userspace drivers > >> - Device emulation process restart is trivial > >> (it therefore makes device code upgrades much easier) > >> > >> Having said that, nothing stops us from enhancing libmuser to talk > >> directly to Qemu (for the Qemu case). I envision at least two ways of > >> doing that: > >> - Hooking up libmuser with Qemu directly (eg. over a unix socket) > >> - Hooking Qemu with CUSE and implementing the muser.ko interface > >> > >> For the latter, libmuser would talk to a character device just like it > >> talks to the vfio character device. We "just" need to implement that > >> backend in Qemu. :) > > > > What about: > > * libmuser's API stays mostly unchanged but the library speaks a > > VFIO-over-UNIX domain sockets protocol instead of talking to > > mdev/vfio in the host kernel. > > As I said above, there are advantages to the kernel model. The key one > is transparent device emulation restarts. Today, muser.ko keeps the > "device memory" internally in a prefix tree. Upon restart, a new > device emulator can recover state (eg. from a state file in /dev/shm > or similar) and remap the same memory that is already configured to > the guest via Qemu. We have a pending work item for muser.ko to also > keep the eventfds so we can recover those, too. Another advantage is > working with any userspace driver and not requiring a VMM at all. > > If done entirely in userspace, the device emulator needs to allocate > the device memory somewhere that remains accessible (eg. tmpfs), with > the difference that now we may be talking about non-trivial amounts of > memory. Also, that may not be the kind of content you want lingering > around the filesystem (for the same reasons Qemu unlinks memory files > from /dev/hugepages after mmap'ing it). > > That's why I'd prefer to rephrase what you said to "in addition" > instead of "instead". > > > * VMMs can implement this protocol directly for POSIX-portable and > > unprivileged operation. > > * A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can > > still take advantage of libmuser devices. > > I'm happy with that. > We need to think the credential aspect throughout to ensure nodes can > be created in the right places with the right privileges. > > > > > Assuming this is feasible, would you lose any important > > features/advantages of the muser.ko approach? I don't know enough about > > VFIO to identify any blocker or obvious performance problems. > > That's what I elaborated above. The fact that muser.ko can keep > various metadata (and other resources) about the device in the kernel > and grant it back to userspace as needed. There are ways around it, > but it requires some orchestration with tmpfs and the VMM (only so > much can be kept in tmpfs; the eventfds need to be retransmitted from > the machine emulator on request). > > Restarting is a critical aspect of this. One key use case for the > project is to be able to emulate various devices from one process (for > polling). That must be able to restart for upgrades or recovery. > > > > > Regarding recovery, it seems straightforward to keep state in a tmpfs > > file that can be reopened when the device is restarted. I don't think > > kernel code is necessary? > > It adds a dependency, but isn't a show stopper. If we can work through > permission issues, making sure the VMM can reconnect and retransmit > eventfds and other state, then it should be ok. > > To be clear: I'm very happy to have a userspace-only option for this, > I just don't want to ditch the kernel module (yet, anyway). :)
If it doesn't create too large of a burden to support both, then I think it is very desirable. IIUC, this is saying a kernel based solution as the optimized/optimal solution, and userspace UNIX socket based option as the generic "works everywhere" fallback solution. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|