> On Oct 25, 2019, at 9:13 AM, Stephen Hemminger <step...@networkplumber.org> > wrote: > > On Thu, 24 Oct 2019 21:45:56 -0700 > Andy Lutomirski <l...@kernel.org> wrote: > >> Hi all- >> >> Supporting iopl() in the Linux kernel is becoming a maintainability >> problem. As far as I know, DPDK is the only major modern user of >> iopl(). >> >> After doing some research, DPDK uses direct io port access for only a >> single purpose: accessing legacy virtio configuration structures. >> These structures are mapped in IO space in BAR 0 on legacy virtio >> devices. > > Yes. Legacy virtio seems to have been designed without consideration > of how to use it in userspace. Xen, Vmware and Hyper-V all use memory > as a doorbell mechanism which is easier to use from userspace. > > >> There are at least three ways you could avoid using iopl(). Here they >> are in rough order of quality in my opinion: >> >> 1. Change pci_uio_ioport_read() and pci_uio_ioport_write() to use >> read() and write() on resource0 in sysfs. > > The cost of entering the kernel for a doorbell mechanism is too > expensive and would kill performance. > > >> 2. Use the alternative access mechanism in the virtio legacy spec: >> there is a way to access all of these structures via configuration >> space. > > There is no way to use memory doorbell on older versions of virtio. > Users want to run DPDK on old stuff like RHEL6 and even older > kernel forks. There are even use cases where virtio is used for > a non-Linux host; such as GCP. > > >> 3. Use ioperm() instead of iopl(). > > Ioperm has the wrong thread semantics. All DPDK applications have > multiple threads and the initialization logic needs to work even > if the thread is started later; threads can also be started by > the user application. > > Iopl applies to whole process so this is not an issue.
This is not true. ioperm() and iopl() have identical thread semantics. I think what you’re seeing is that you can set iopl(3) early without knowing which port range to request. You could alternatively set ioperm() early and ask for a very wide range. In principle, we could make ioperm() be per thread, but I’m not sure we should add that kind of complexity to support a mostly obsolete use case like this. There's actually an argument to be made that per-mm ioperm would be easier to handle in the kernel than per-task due to the vagaries of KPTI. All this being said, what are the actual performance implications of write() to /sys/.../resource0? Off the top of my head, I would guess that the actual OUTB or OUTL instruction itself is incredibly slow due to being trapped and emulated and that virtio-legacy hypervisors aren't particularly fast to begin with and that, as a result, the write() might not actually matter that much. > >> >> >> We are considering changes to the kernel that will potentially harm >> the performance of any program that uses iopl(3) -- in particular, >> context switches will become more expensive, and the scheduler might >> need to explicitly penalize such programs to ensure fairness. Using >> ioperm() already hurts performance, and the proposed changes to iopl() >> will make it even worse. Alternatively, the kernel could drop iopl() >> support entirely. I will certainly make a change to allow >> distributions to remove iopl() support entirely from their kernels, >> and I expect that distributions will do this. >> >> Please fix DPDK. > > Please fix virtio. Done, with the new version of virtio :)