[Xen-devel] Patches for Nvidia GPU passthrough
' ] This makes the /proc/cpuinfo almost identical between KVM and Xen VMs running Linux. Only exceptions are flags "rep_good" (which is missing under Xen) and "eager_fpu" and "xsaveopt" (not seen under KVM), but as these are not explicitly set by CPUID but are Linux-specific flags, they shouldn't (?) matter on Windows VMs. -- Anyway, even applying all of these patches would not alleviate Code 43. To be more specific, all NVidia drivers up to 364.72 would BSOD on boot (SYSTEM_SERVICE_EXCEPTION), and newer drivers (368.22+) would cause Code 43. This happens on both Windows 7 Pro and 8.1 VMs. Result on qemu-xen and -traditional is identical. Dom0 is Qubes 3.1 (Linux 4.1.24), Xen 4.6.1. Hardware: Intel i7-5820K, Asrock X99 WS motherboard, 32GB Corsair mem, EVGA GTX980. I would love if some of you could try these patches with both newer and older NVidia cards. Also any suggestions, ideas and further patches would be greatly appreciated! :) Thanks! Best regards, Marcus diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/libxl/libxl_cpuid.c xen-4.6.1-new/tools/libxl/libxl_cpuid.c --- xen-4.6.1/tools/libxl/libxl_cpuid.c 2016-02-09 16:44:19.0 +0200 +++ xen-4.6.1-new/tools/libxl/libxl_cpuid.c 2016-07-10 12:09:36.09200 +0300 @@ -318,12 +318,31 @@ if (endptr == NULL) { endptr = strchr(str, 0); } -if (endptr - str != 32) { -return 5; -} + entry->policy[value] = calloc(32 + 1, 1); -strncpy(entry->policy[value], str, 32); +switch (endptr - str) { + case 32: { +strncpy(entry->policy[value], str, 32); + } + break; + case 8: { + uint32_t cpuid_hex = strtoul(str,&endptr,16); + if ( str +8 != endptr ) + return 6; + for (int i=0;i<32;i++) { + if ( cpuid_hex & (1<policy[value][31-i]='1'; + else +entry->policy[value][31-i]='0'; + } +entry->policy[value][32]=0; + } + break; + default: + return 5; +} entry->policy[value][32] = 0; + if (*endptr == 0) { break; } diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/libxl/xl_cmdimpl.c xen-4.6.1-new/tools/libxl/xl_cmdimpl.c --- xen-4.6.1/tools/libxl/xl_cmdimpl.c 2016-07-11 23:45:45.04600 +0300 +++ xen-4.6.1-new/tools/libxl/xl_cmdimpl.c 2016-07-10 12:07:55.56400 +0300 @@ -2095,7 +2095,10 @@ errstr = "invalid register name (must be e[abcd]x)"; break; case 5: -errstr = "policy string must be exactly 32 characters long"; +errstr = "policy string must be exactly 32 (binary) or 8 (hex) characters long"; +break; +case 6: +errstr = "error decoding policy string"; break; default: errstr = "unknown error"; diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/firmware/hvmloader/hvmloader.c xen-4.6.1-new/tools/firmware/hvmloader/hvmloader.c --- xen-4.6.1/tools/firmware/hvmloader/hvmloader.c 2016-02-09 16:44:19.0 +0200 +++ xen-4.6.1-new/tools/firmware/hvmloader/hvmloader.c 2016-07-04 23:31:32.81500 +0300 @@ -127,9 +127,11 @@ if ( !strcmp("XenVMMXenVMM", signature) ) break; +if ( !strcmp("ZenZenZenZen", signature) ) +break; } -BUG_ON(strcmp("XenVMMXenVMM", signature) || ((eax - base) < 2)); +BUG_ON( (strcmp("XenVMMXenVMM", signature) && strcmp("ZenZenZenZen", signature) ) || ((eax - base) < 2)); /* Fill in hypercall transfer pages. */ cpuid(base + 2, &eax, &ebx, &ecx, &edx); diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/libxl/libxl_create.c xen-4.6.1-new/tools/libxl/libxl_create.c --- xen-4.6.1/tools/libxl/libxl_create.c 2016-07-09 16:47:05.18100 +0300 +++ xen-4.6.1-new/tools/libxl/libxl_create.c 2016-07-04 23:49:54.80200 +0300 @@ -284,6 +284,8 @@ libxl_defbool_setdefault(&b_info->u.hvm.acpi_s4,true); libxl_defbool_setdefault(&b_info->u.hvm.nx, true); libxl_defbool_setdefault(&b_info->u.hvm.viridian, false); +libxl_defbool_setdefault(&b_info->u.hvm.spoof_viridian, false); +libxl_defbool_setdefault(&b_info->u.hvm.spoof_xen, false); libxl_defbool_setdefault(&b_info->u.hvm.hpet, true); libxl_defbool_setdefault(&b_info->u.hvm.vpt_align, true); libxl_defbool_s
Re: [Xen-devel] [PATCH v2] x86/mm: also flush TLB when putting writable foreign page reference
fe if we IPI it to flush the TLB (though may need memory > >>>barriers -- need to think about a race with CPU C putting A _into_ > >>>the map at the same time...) > >>> - we could track the timestamp of the most recent addition to the > >>>map, and drop any CPU whose TLB has been flushed since that, > >>>but that still lets unrelated unmaps keep CPUs alive in the map... > >>> - we could double-buffer the map: always add CPUs to the active map; > >>>from time to time, swap maps and flush everything in the non-active > >>>map (filtered by the TLB timestamp when we last swapped over). > >>> > >>> Bah, this is turning into a tar pit. Let's stick to the v2 patch as > >>> being (relatively) simple and correct, and revisit this if it causes > >>> trouble. :) > >> :( > >> > >> A 70% performance hit for guest creation is certainly going to cause > >> problems, but we obviously need to prioritise correctness in this case. > > Hmm, you did understand that the 70% hit is on a specific sub-part of > > the overall process, not guest creation as a whole? Anyway, your reply > > is neither an ack nor a nak nor an indication of what needs to change > > ... > > Yes - I realise it isn't all of domain creation, but this performance hit > will also > hit migration, qemu DMA mappings, etc. > > XenServer has started a side-by-side performance work-up of this change, as > presented at the root of this thread. We should hopefully have some > number in the next day or two. > I did some measurements on two builds of a recent version of XenServer using Xen upstream 4.9.0-3.0. The only difference between the builds was the patch x86-put-l1e-foreign-flush.patch in https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02945.html. I observed no measurable difference between these builds with a guest RAM value of 4G, 8G and 14G for the following operations: - time xe vm-start - time xe vm-shutdown - vm downtime during "xe vm-migration" (as measured by pinging the vm during migration and verifying for how long pings would fail when both domains are paused) - time xe vm-migrate # for HVM guests (eg. win7 and win10) But I observed a difference for the duration of "time xe vm-migrate" for PV guests (eg. centos68, debian70, ubuntu1204). For centos68, for instance, I obtained the following values on a machine with a Intel E3-1281v3 3.7Ghz CPU, averaged over 10 runs for each data point: | Guest RAM | no patch | with patch | difference | diff/RAM | | 14GB| 10.44s | 13.46s |3.02s |0.22s/GB| |8GB|6.46s |8.28s |1.82s |0.23s/GB| |4GB|3.85s |4.74s |0.89s |0.22s/GB| From these numbers, if the patch is present, it looks like VM migration of a PV guest would take an extra 1s for each extra 5GB of guest RAM. The VMs are mostly idle during migration. At this point, it's not clear to me why this difference is only visible on VM migration (as opposed to VM start for example), and only on a PV guest (as opposed to an HVM). Marcus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/3] xen/block: add multi-page ring support
ed with sequential reads of many different block sizes and io depths, and we only spotted it because of our synthetic load with fio used a wide range of parameters with sequential reads. It may also be specific to the way that Linux handles this situation. (B)- in other situations with sequential read (block sizes between 8KiB and 128KiB), we observed the storage throughput with 1 page was around 50% worse than with 8 pages. Again, this seems related to the existence of merges with 1 page but not with 8 pages, and I would appreciate potential explanations. For sequential reads, arguably the performance difference spotted in (A) is counter balanced by the performance difference in (B), and they cancel each other out if all block sizes are considered together. For random reads, 8-page rings were similar or superior to 1-page rings in all tested conditions. All considered, we believe that the multi-page ring patches improve the storage performance (apart from case (A)) and therefore should be good to merge. Marcus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback
On 13/05/15 11:29, Bob Liu wrote: On 04/28/2015 03:46 PM, Arianna Avanzini wrote: Hello Christoph, Il 28/04/2015 09:36, Christoph Hellwig ha scritto: What happened to this patchset? It was passed on to Bob Liu, who published a follow-up patchset here: https://lkml.org/lkml/2015/2/15/46 Right, and then I was interrupted by another xen-block feature: 'multi-page' ring. Will back on this patchset soon. Thank you! -Bob Hi, Our measurements for the multiqueue patch indicate a clear improvement in iops when more queues are used. The measurements were obtained under the following conditions: - using blkback as the dom0 backend with the multiqueue patch applied to a dom0 kernel 4.0 on 8 vcpus. - using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend applied to be used as a guest on 4 vcpus - using a micron RealSSD P320h as the underlying local storage on a Dell PowerEdge R720 with 2 Xeon E5-2643 v2 cpus. - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. We used direct_io to skip caching in the guest and ran fio for 60s reading a number of block sizes ranging from 512 bytes to 4MiB. Queue depth of 32 for each queue was used to saturate individual vcpus in the guest. We were interested in observing storage iops for different values of block sizes. Our expectation was that iops would improve when increasing the number of queues, because both the guest and dom0 would be able to make use of more vcpus to handle these requests. These are the results (as aggregate iops for all the fio threads) that we got for the conditions above with sequential reads: fio_threads io_depth block_size 1-queue_iops 8-queue_iops 8 32 512 158K 264K 8 321K 157K 260K 8 322K 157K 258K 8 324K 148K 257K 8 328K 124K 207K 8 32 16K84K 105K 8 32 32K50K 54K 8 32 64K24K 27K 8 32 128K11K 13K 8-queue iops was better than single queue iops for all the block sizes. There were very good improvements as well for sequential writes with block size 4K (from 80K iops with single queue to 230K iops with 8 queues), and no regressions were visible in any measurement performed. Marcus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel