Hi Laszlo, Looks like CONFIG_PREEMPT_VOLUNTARY isn't set.
CONFIG_PREEMPT_RCU=y CONFIG_PREEMPT_NOTIFIERS=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y CONFIG_DEBUG_PREEMPT=y # CONFIG_PREEMPTIRQ_EVENTS is not set # CONFIG_PREEMPT_TRACER is not set I'll recompile the kernel tonight if I get time with that flag set and see if it makes a difference. - Adam On Mon, Nov 19, 2018 at 10:35 AM Laszlo Ersek <ler...@redhat.com> wrote: > > On 11/16/18 22:36, A Z wrote: > > And just a quick discovery: > > > > With 24G assigned (-m 24576)... > > > > Reducing the number of CPU cores to 1, 2 or 3 with > > ovmf-x64-0-20181105.757.g114a75ad86 results in a really fast boot to > > the TianoCore splash. > > As soon as the number of CPU cores jumps to 4, the VM takes a hit in > > terms of boot time. > > > > 1 core, 1 thread: ~16 seconds. > > 2 cores, 1 thread: ~16 seconds > > 3 cores, 1 thread: ~19 seconds > > 4 cores, 1 thread: ~111 seconds > > 8 cores, 1 thread: ~150 seconds > > 8 cores, 2 threads: ~334 seconds > > > > (timing is based on grepping debugcon.log for "[Bds]OsIndication:", > > which seems to occur right after the splash screen is displayed) > > > > Significant pauses can be observed at points in the boot process, > > based on the debugcon output... > > > > First pause is here: > > > >> Loading PEIM at 0x0007FEC1000 EntryPoint=0x0007FEC6985 CpuMpPei.efi > >> Register PPI Notify: F894643D-C449-42D1-8EA8-85BDD8C65BDE > >> Notify: PPI Guid: F894643D-C449-42D1-8EA8-85BDD8C65BDE, Peim notify entry > >> point: 7FEC58C4 > >> AP Loop Mode is 1 > >> WakeupBufferStart = 9F000, WakeupBufferSize = 1000 > > > > Then here: > > > >> Loading driver at 0x0007F96C000 EntryPoint=0x0007F9743EE CpuDxe.efi > >> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7F4AC618 > >> ProtectUefiImageCommon - 0x7F4AE040 > >> - 0x000000007F96C000 - 0x000000000000E3C0 > >> Paging: added 512 pages to page table pool > >> CurrentPagingContext: > >> MachineType - 0x8664 > >> PageTableBase - 0x7FA01000 > >> Attributes - 0xC0000006 > >> InstallProtocolInterface: 26BACCB1-6F42-11D4-BCE7-0080C73C8881 7F979E20 > >> MemoryProtectionCpuArchProtocolNotify: > > <snip> > >> ConvertPages: failed to find range A0000 - FFFFF > >> ConvertPages: failed to find range 80000000 - 8FFFFFFF > >> ConvertPages: failed to find range 90000000 - FBFFFFFF > >> ConvertPages: failed to find range FEC00000 - FEC00FFF > >> Failed to update capability: [8] 00000000FED00000 - 00000000FED003FF > >> (C700000000000001 -> C700000000026001) > >> ConvertPages: failed to find range FED1C000 - FED1FFFF > >> ConvertPages: failed to find range FEE00000 - FEEFFFFF > >> ConvertPages: failed to find range FFE00000 - FFFFFFFF > >> AP Loop Mode is 1 > >> WakeupBufferStart = 9F000, WakeupBufferSize = 229 > > > > And the last one: > > > >> Loading driver at 0x0007F918000 EntryPoint=0x0007F91D876 > >> PciHostBridgeDxe.efi > >> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7F01D798 > >> ProtectUefiImageCommon - 0x7F01D1C0 > >> - 0x000000007F918000 - 0x0000000000009F00 > >> Select Item: 0x0 > >> FW CFG Signature: 0x554D4551 > >> Select Item: 0x1 > >> FW CFG Revision: 0x3 > >> QemuFwCfg interface (DMA) is supported. > >> Select Item: 0x19 > >> InitRootBridge: populated root bus 0, with room for 255 subordinate bus(es) > >> RootBridge: PciRoot(0x0) > >> Support/Attr: 70069 / 70069 > >> DmaAbove4G: No > >> NoExtConfSpace: No > >> AllocAttr: 3 (CombineMemPMem Mem64Decode) > >> Bus: 0 - FF Translation=0 > >> Io: 6000 - FFFF Translation=0 > >> Mem: 90000000 - FBFFFFFF Translation=0 > >> MemAbove4G: 800000000 - FFFFFFFFF Translation=0 > >> PMem: FFFFFFFFFFFFFFFF - 0 Translation=0 > >> PMemAbove4G: FFFFFFFFFFFFFFFF - 0 Translation=0 > > > > > > Regards, > > Adam > > > > > > On Thu, Nov 15, 2018 at 10:28 PM A Z <a...@zegelin.com> wrote: > >> > >> On Thu, Nov 15, 2018 at 2:47 AM Laszlo Ersek <ler...@redhat.com> wrote: > >>> > >>> If nothing at all had changed in your configuration, then the behavior > >>> wouldn't have changed either. You must certainly have upgraded some > >>> components since then, between host kernel, QEMU, and OVMF (likely all > >>> of them). Please state the version numbers that work and don't work, and > >>> try to isolate the issue to a version change that affects a *single* > >>> component. > >> > >> I wasn't clear on this, sorry. I meant that the VM configuration -- > >> the qemu command-line parameters, *except* for the OVMF firmware image > >> and the amount of memory assigned, haven't changed since I got > >> everything up-and-running (i.e., CPU caps, disk, network and all the > >> PCI-passthrough assignments, etc haven't changed.) > >> I added more memory to the host a few months ago, and upped the VM > >> allocation to match at the same time. This made the boot times even > >> worse. > >> I've also occasionally downloaded a newer version of OVMF to see if it > >> fixes this boot time issue. > >> The host OS kernel, qemu and other package versions have changed over > >> this period as I've installed OS updates. So has the build of Windows > >> 10. > >> > >>> I vaguely recall similar reports from the past. I've been under the > >>> impression that this kind of issue has been fixed. Thus, I'd guess you > >>> are running old(er) versions of the affected component (whichever it > >>> is), or else you are running bleeding edge, and the bug has crept back > >>> in (new regression). > >> > >> It's possibly the latter. > >> > >> Host kernel: 4.18.10-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 26 09:48:22 > >> UTC 2018 x86_64 GNU/Linux > >> qemu version: QEMU emulator version 3.0.0 > >> OMVF build: ovmf-x64-0-20181105.757.g114a75ad86 (downloaded from > >> www.kraxel.org) > >> > >> > >> On Thu, Nov 15, 2018 at 6:14 AM Hristo Iliev <hri...@hiliev.eu> wrote: > >> > >>> This sounds quite like a problem I used to have in the past with certain > >>> versions of OVMF and the Linux kernel. If I remember correctly, some > >>> memory ranges would get marked as non-cacheable, which resulted in the > >>> terrible slowdown you describe. The resolution back then was to stick to > >>> older versions of both the firmware and the kernel. > >>> > >>> I still keep around an older version of OVMF that I used until recently - > >>> edk2.git-ovmf-x64-0-20160714.b1992.gfc3005.noarch.rpm. You could download > >>> the RPM here and try if it works for you: > >>> > >>> https://exa.flopser.de/s/z3cjqFUlZ2gBPCy > >> > >> I will try this firmware build. I have read similar things about > >> versions from 2016 working correctly. > >> But TBH I'd also prefer to run something newer -- bug fixes (and > >> regressions!), security patches, etc. > >> > >>> Recent QEMU versions started complaining about firmware > >>> incompatibilities, so I tried the latest RPM from Kraxel > >>> (https://www.kraxel.org/repos/jenkins/edk2/) and it works just fine. The > >>> host system is Arch Linux with the latest Arch kernel and QEMU. > >> > >> While I'm not running today's build, I am running one downloaded from > >> Kraxel.org within the past week. > >> > >> > >> On Thu, Nov 15, 2018 at 7:28 AM Alex Williamson > >> <alex.william...@redhat.com> wrote: > >> > >>> If you can confirm that your system uses a NUMA memory configuration > >>> and that 24G exceeds the node local memory, then this is a known > >>> regression in the Linux kernel mm. AIUI it's being worked upstream, > >>> workarounds are to either use hugepages or locked memory for the VM. > >> > >> I'm currently using hugepages. Not sure how to lock the memory, sorry. > >> > >> It is a dual CPU system. The topology: > >> > >> $ sudo numactl -H > >> available: 2 nodes (0-1) > >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 > >> node 0 size: 32143 MB > >> node 0 free: 236 MB > >> node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 > >> 43 44 45 46 47 > >> node 1 size: 32232 MB > >> node 1 free: 95 MB > >> node distances: > >> node 0 1 > >> 0: 10 21 > >> 1: 21 10 > >> > >> > >> > >> For reference, the qemu command line I'm exec'ing from a script is: > >> > >> /usr/sbin/qemu-system-x86_64 \ > >> -name win10,debug-threads=on -monitor > >> unix:win10.sock,server,nowait -nodefconfig -debugcon file:debugcon.log > >> -global isa-debugcon.iobase=0x402 \ > >> -nodefconfig -no-user-config -nodefaults -nographic \ > >> -machine q35,accel=kvm \ > >> -enable-kvm -cpu > >> host,kvm=off,hv_spinlocks=0x1fff,hv_relaxed,hv_time,hv_vendor_id=Nvidia43FIX > >> \ > >> -rtc base=localtime,clock=host,driftfix=slew \ > >> \ > >> -drive > >> file=/vmstorage/firmware/ovmf-x64-0-20181105.757.g114a75ad86/OVMF-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on > >> \ > >> -drive > >> file=/home/adam/win10-vars.fd,if=pflash,format=raw,readonly=off,unit=1 > >> \ > >> \ > >> -object iothread,id=io1 \ > >> \ > >> -mem-path /dev/hugepages \ > >> -m 24576 \ > >> -smp cores=8,threads=2,sockets=1 \ > >> \ > >> -netdev bridge,id=netdev0,br=br0 -device > >> virtio-net-pci,netdev=netdev0,mac=52:54:00:19:34:57 \ > >> -device > >> virtio-scsi-pci,id=scsi0,ioeventfd=on,iothread=io1,num_queues=4 \ > >> \ > >> -device vfio-pci,host=7:00.0,addr=0x6,multifunction=on \ > >> -device vfio-pci,host=7:00.1,addr=0x6.0x1 \ > >> -device vfio-pci,host=05:00.0 \ > >> -device vfio-pci,host=06:00.0 > > Can you grep your kernel config for "PREEMPT"? I remember that for the > Lime Tech guys, CONFIG_PREEMPT_VOLUNTARY made a huge difference, in the > same spots. It might help you too. > > I do notice that your "uname" output already includes "PREEMPT". I don't > remember (and don't have time now) to look up what knob does exactly > what. Please try massaging these knobs in your host kernel config, and > see if they make a difference. > > When CONFIG_PREEMPT_VOLUNTARY was identified as the one bit making the > difference in the Lime Tech case, I requested that they please document > their findings on the web somewhere, perhaps in a forum or in a blog > post. I'm not sure if that happened after all. > > For some hand-wavy explanation, I can offer this: the two spots in the > OVMF debug log are related to the guest firmware starting up all VCPUs > at once, and those VCPUs fighting over some spinlocks (by design). PLE > (pause loop exiting) should really help here, but apparently, it doesn't > always do that. So, with CONFIG_PREEMPT (or without it, I don't > remember), the host kernel threads get scheduled in a way that is > exactly "adversarial" to those VCPUs making good progress. E.g. if a > VCPU owns a spinlock but the corresponding host kernel thread is > de-scheduled for a long time, the other VCPUs / host kernel threads > won't make any progress, no matter how hard they work. Yada yada. As I > said, hand-wavy, sorry. > > Thanks > Laszlo _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://www.redhat.com/mailman/listinfo/vfio-users