[Xen-devel] Patches for Nvidia GPU passthrough

2016-07-14 Thread marcus
'
]

This makes the /proc/cpuinfo almost identical between KVM and Xen VMs 
running Linux. Only exceptions are flags "rep_good" (which is missing 
under Xen) and "eager_fpu" and "xsaveopt" (not seen under KVM), but as 
these are not explicitly set by CPUID but are Linux-specific flags, they 
shouldn't (?) matter on Windows VMs.


--

Anyway, even applying all of these patches would not alleviate Code 43. 
To be more specific, all NVidia drivers up to 364.72 would BSOD on boot 
(SYSTEM_SERVICE_EXCEPTION), and newer drivers (368.22+) would cause Code 
43. This happens on both Windows 7 Pro and 8.1 VMs. Result on qemu-xen 
and -traditional is identical. Dom0 is Qubes 3.1 (Linux 4.1.24), Xen 
4.6.1. Hardware: Intel i7-5820K, Asrock X99 WS motherboard, 32GB Corsair 
mem, EVGA GTX980.


I would love if some of you could try these patches with both newer and 
older NVidia cards. Also any suggestions, ideas and further patches 
would be greatly appreciated! :)


Thanks!

Best regards,
Marcus



diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/libxl/libxl_cpuid.c xen-4.6.1-new/tools/libxl/libxl_cpuid.c
--- xen-4.6.1/tools/libxl/libxl_cpuid.c	2016-02-09 16:44:19.0 +0200
+++ xen-4.6.1-new/tools/libxl/libxl_cpuid.c	2016-07-10 12:09:36.09200 +0300
@@ -318,12 +318,31 @@
 if (endptr == NULL) {
 endptr = strchr(str, 0);
 }
-if (endptr - str != 32) {
-return 5;
-}
+
 entry->policy[value] = calloc(32 + 1, 1);
-strncpy(entry->policy[value], str, 32);
+switch (endptr - str) {
+	case 32: {
+strncpy(entry->policy[value], str, 32);
+		}
+	break;
+	case 8: {
+		uint32_t cpuid_hex = strtoul(str,&endptr,16);
+		if ( str +8 != endptr )
+			return 6;
+		for (int i=0;i<32;i++) {
+			if ( cpuid_hex & (1<policy[value][31-i]='1';
+			else
+entry->policy[value][31-i]='0';
+		}
+entry->policy[value][32]=0;
+		}
+	break;
+	default:
+		return 5;
+}
 entry->policy[value][32] = 0;
+
 if (*endptr == 0) {
 break;
 }
diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/libxl/xl_cmdimpl.c xen-4.6.1-new/tools/libxl/xl_cmdimpl.c
--- xen-4.6.1/tools/libxl/xl_cmdimpl.c	2016-07-11 23:45:45.04600 +0300
+++ xen-4.6.1-new/tools/libxl/xl_cmdimpl.c	2016-07-10 12:07:55.56400 +0300
@@ -2095,7 +2095,10 @@
 errstr = "invalid register name (must be e[abcd]x)";
 break;
 case 5:
-errstr = "policy string must be exactly 32 characters long";
+errstr = "policy string must be exactly 32 (binary) or 8 (hex) characters long";
+break;
+case 6:
+errstr = "error decoding policy string";
 break;
 default:
 errstr = "unknown error";

diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/firmware/hvmloader/hvmloader.c xen-4.6.1-new/tools/firmware/hvmloader/hvmloader.c
--- xen-4.6.1/tools/firmware/hvmloader/hvmloader.c	2016-02-09 16:44:19.0 +0200
+++ xen-4.6.1-new/tools/firmware/hvmloader/hvmloader.c	2016-07-04 23:31:32.81500 +0300
@@ -127,9 +127,11 @@
 
 if ( !strcmp("XenVMMXenVMM", signature) )
 break;
+if ( !strcmp("ZenZenZenZen", signature) )
+break;
 }
 
-BUG_ON(strcmp("XenVMMXenVMM", signature) || ((eax - base) < 2));
+BUG_ON( (strcmp("XenVMMXenVMM", signature) && strcmp("ZenZenZenZen", signature) ) || ((eax - base) < 2));
 
 /* Fill in hypercall transfer pages. */
 cpuid(base + 2, &eax, &ebx, &ecx, &edx);
diff -ur -x .cproject -x .project -x '*.swp' xen-4.6.1/tools/libxl/libxl_create.c xen-4.6.1-new/tools/libxl/libxl_create.c
--- xen-4.6.1/tools/libxl/libxl_create.c	2016-07-09 16:47:05.18100 +0300
+++ xen-4.6.1-new/tools/libxl/libxl_create.c	2016-07-04 23:49:54.80200 +0300
@@ -284,6 +284,8 @@
 libxl_defbool_setdefault(&b_info->u.hvm.acpi_s4,true);
 libxl_defbool_setdefault(&b_info->u.hvm.nx, true);
 libxl_defbool_setdefault(&b_info->u.hvm.viridian,   false);
+libxl_defbool_setdefault(&b_info->u.hvm.spoof_viridian, false);
+libxl_defbool_setdefault(&b_info->u.hvm.spoof_xen,  false);
 libxl_defbool_setdefault(&b_info->u.hvm.hpet,   true);
 libxl_defbool_setdefault(&b_info->u.hvm.vpt_align,  true);
 libxl_defbool_s

Re: [Xen-devel] [PATCH v2] x86/mm: also flush TLB when putting writable foreign page reference

2017-05-05 Thread Marcus Granado
fe if we IPI it to flush the TLB (though may need memory
> >>>barriers -- need to think about a race with CPU C putting A _into_
> >>>the map at the same time...)
> >>>  - we could track the timestamp of the most recent addition to the
> >>>map, and drop any CPU whose TLB has been flushed since that,
> >>>but that still lets unrelated unmaps keep CPUs alive in the map...
> >>>  - we could double-buffer the map: always add CPUs to the active map;
> >>>from time to time, swap maps and flush everything in the non-active
> >>>map (filtered by the TLB timestamp when we last swapped over).
> >>>
> >>> Bah, this is turning into a tar pit.  Let's stick to the v2 patch as
> >>> being (relatively) simple and correct, and revisit this if it causes
> >>> trouble. :)
> >> :(
> >>
> >> A 70% performance hit for guest creation is certainly going to cause
> >> problems, but we obviously need to prioritise correctness in this case.
> > Hmm, you did understand that the 70% hit is on a specific sub-part of
> > the overall process, not guest creation as a whole? Anyway, your reply
> > is neither an ack nor a nak nor an indication of what needs to change
> > ...
> 
> Yes - I realise it isn't all of domain creation, but this performance hit 
> will also
> hit migration, qemu DMA mappings, etc.
> 
> XenServer has started a side-by-side performance work-up of this change, as
> presented at the root of this thread.  We should hopefully have some
> number in the next day or two.
> 

I did some measurements on two builds of a recent version of XenServer using 
Xen upstream 4.9.0-3.0. The only difference between the builds was the patch 
x86-put-l1e-foreign-flush.patch  in 
https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02945.html.

I observed no measurable difference between these builds with a guest RAM value 
of 4G, 8G and 14G for the following operations:
- time xe vm-start
- time xe vm-shutdown
- vm downtime during "xe vm-migration" (as measured by pinging the vm during 
migration and verifying for how long pings would fail when both domains are 
paused)
- time xe vm-migrate # for HVM guests (eg. win7 and win10)

But I observed a difference for the duration of "time xe vm-migrate" for PV 
guests (eg. centos68, debian70, ubuntu1204). For centos68, for instance, I 
obtained the following values on a machine with a Intel E3-1281v3 3.7Ghz CPU, 
averaged over 10 runs for each data point:
|   Guest RAM   |  no patch  | with patch | difference |  diff/RAM | 
|   14GB|   10.44s   |   13.46s   |3.02s   |0.22s/GB|
|8GB|6.46s   |8.28s   |1.82s   |0.23s/GB|
|4GB|3.85s   |4.74s   |0.89s   |0.22s/GB|

From these numbers, if the patch is present, it looks like VM migration of a PV 
guest would take an extra 1s for each extra 5GB of guest RAM. The VMs are 
mostly idle during migration. At this point, it's not clear to me why this 
difference is only visible on VM migration (as opposed to VM start for 
example), and only on a PV guest (as opposed to an HVM).

Marcus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/3] xen/block: add multi-page ring support

2015-06-23 Thread Marcus Granado
ed with sequential reads of many different block sizes and io 
depths, and we only spotted it because of our synthetic load with fio 
used a wide range of parameters with sequential reads. It may also be 
specific to the way that Linux handles this situation.


(B)- in other situations with sequential read (block sizes between 8KiB 
and 128KiB), we observed the storage throughput with 1 page was around 
50% worse than with 8 pages. Again, this seems related to the existence 
of merges with 1 page but not with 8 pages, and I would appreciate 
potential explanations.


For sequential reads, arguably the performance difference spotted in (A) 
is counter balanced by the performance difference in (B), and they 
cancel each other out if all block sizes are considered together. For 
random reads, 8-page rings were similar or superior to 1-page rings in 
all tested conditions.


All considered, we believe that the multi-page ring patches improve the 
storage performance (apart from case (A)) and therefore should be good 
to merge.


Marcus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

2015-06-30 Thread Marcus Granado

On 13/05/15 11:29, Bob Liu wrote:


On 04/28/2015 03:46 PM, Arianna Avanzini wrote:

Hello Christoph,

Il 28/04/2015 09:36, Christoph Hellwig ha scritto:

What happened to this patchset?



It was passed on to Bob Liu, who published a follow-up patchset here: 
https://lkml.org/lkml/2015/2/15/46



Right, and then I was interrupted by another xen-block feature: 'multi-page' 
ring.
Will back on this patchset soon. Thank you!

-Bob



Hi,

Our measurements for the multiqueue patch indicate a clear improvement 
in iops when more queues are used.


The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to 
a dom0 kernel 4.0 on 8 vcpus.


- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend 
applied to be used as a guest on 4 vcpus


- using a micron RealSSD P320h as the underlying local storage on a Dell 
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.


- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. 
We used direct_io to skip caching in the guest and ran fio for 60s 
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue 
depth of 32 for each queue was used to saturate individual vcpus in the 
guest.


We were interested in observing storage iops for different values of 
block sizes. Our expectation was that iops would improve when increasing 
the number of queues, because both the guest and dom0 would be able to 
make use of more vcpus to handle these requests.


These are the results (as aggregate iops for all the fio threads) that 
we got for the conditions above with sequential reads:


fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops
8   32   512   158K 264K
8   321K   157K 260K
8   322K   157K 258K
8   324K   148K 257K
8   328K   124K 207K
8   32   16K84K 105K
8   32   32K50K  54K
8   32   64K24K  27K
8   32  128K11K  13K

8-queue iops was better than single queue iops for all the block sizes. 
There were very good improvements as well for sequential writes with 
block size 4K (from 80K iops with single queue to 230K iops with 8 
queues), and no regressions were visible in any measurement performed.


Marcus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel