Re: [Xen-devel] [Xen-unstable] boot crash while loading AMD microcode due to commit "microcode/amd: fix memory leak"

2019-08-30 Thread Sander Eikelenboom
On 30/08/2019 09:49, Jan Beulich wrote:
> On 30.08.2019 04:09, Chao Gao wrote:
>> On Fri, Aug 30, 2019 at 01:04:54AM +0200, Sander Eikelenboom wrote:
>>> L.S.,
>>>
>>> While testing xen-unstable, my AMD system crashes during early boot while 
>>> loading microcode with an "Early fatal page fault".
>>> Reverting commit de45e3ff37bb1602796054afabfa626ea5661c45 "microcode/amd: 
>>> fix memory leak" fixes the boot issue.
>>
>> Sorry for this inconvenience.
>>
>> Could you apply the patch attached and try it again?
> 
> I'm inclined to take this fix even without waiting for Sander's
> feedback (and simply implying your S-o-b). Andrew - what do you
> think?
> 
> Jan
> 

Just tested and it works for me, thanks!

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/2] tools/shim: Bodge things harder

2019-09-02 Thread Sander Eikelenboom
On 02/09/2019 18:41, Andrew Cooper wrote:
> This logic is all terrible.  This series should resolve the reported build
> failure, but definitely isn't a "proper" fix.
> 
> Andrew Cooper (2):
>   tools/shim: Fix race condition creating linkfarm.stamp
>   tools/shim: Apply more duct tape to the linkfarm logic
> 
>  tools/firmware/xen-dir/Makefile | 27 +++
>  1 file changed, 23 insertions(+), 4 deletions(-)
> 

Thanks Andrew, Just tested and it works for me!

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Xen-unstable: regression when trying to shutdown HVM guest with pci passthrough

2019-09-29 Thread Sander Eikelenboom
Hi Anthony,

While testing I encountered a problem with my HVM guests which use pci 
passthrough.
When trying to shutdown the guest it will stay in the "---s--" runstate 
indefinitely.

On the guest console I get:
[  518.587669] xenbus: xenbus_dev_shutdown: device/pci/0: Initialising != 
Connected, skipping
[  518.674870] ACPI: Preparing to enter system sleep state S5
[  518.683952] reboot: Power down

When trying to destroy the stuck guest I get:
libxl: error: libxl_domain.c:1165:destroy_domid_pci_done: Domain 9:Pci 
shutdown failed
libxl: error: libxl_domain.c:1089:domain_destroy_callback: Domain 
9:Unable to destroy guest
libxl: error: libxl_domain.c:1016:domain_destroy_cb: Domain 
9:Destruction of domain failed destroy failed (rc=-9)

Bisection turned up commit fae4880c45fe015e567afa223f78bf17a6d98e1b "libxl_pci: 
Use ev_qmp for pci_remove" as the culprit.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached

2019-09-30 Thread Sander Eikelenboom
On 30/09/2019 19:23, Anthony PERARD wrote:
> Before the problematic commit, libxl used to ignore error when
> destroying (force == true) a passthrough device. If the DM failed to
> detach the pci device within the allowed time, the timed out error
> raised skip part of pci_remove_*, but also raise the error up to the
> caller of libxl__device_pci_destroy_all, libxl__destroy_domid, and
> thus the destruction of the domain fails.
> 
> When a *pci_destroy* function is called (so we have force=true), error
> should mostly be ignored. If the DM didn't confirmed that the device
> is removed, we will print a warning and keep going if force=true.
> The patch reorder the functions so that pci_remove_timeout() calls
> pci_remove_detatched() like it's done when DM calls are successful.
> 
> We also clean the QMP states and associated timeouts earlier, as soon
> as they are not needed anymore.
> 
> Reported-by: Sander Eikelenboom 
> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b
> Signed-off-by: Anthony PERARD 

Hi Anthony,

Just tested and it works for me, thanks !

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-01-18 Thread Sander Eikelenboom
Hi Roger,

I gave PVH dom0 a spin, see how far I would get.
With current xen-unstable unfortunately not that far, i got the splat below.

If you need more info, would like me to test a patch (or some other git 
tree/branch), 
I will be happy to give it a spin !

--
Sander


 __  ___  __ ___  
 \ \/ /___ _ __   | || |  / |___ \  / _ \_ __ ___ 
  \  // _ \ '_ \  | || |_ | | __) || | | |__| '__/ __|
  /  \  __/ | | | |__   _|| |/ __/ | |_| |__| | | (__ 
 /_/\_\___|_| |_||_|(_)_|_(_)___/   |_|  \___|
  
(XEN) [001a2edd3456] Xen version 4.12.0-rc (r...@dyndns.org) (gcc (Debian 
6.3.0-18+deb9u1) 6.3.0 20170516) debug=y  Fri Jan 18 12:40:33 CET 2019
(XEN) [001a364d8a16] Latest ChangeSet: Mon Jan 14 14:59:37 2019 + 
git:50923ade7a-dirty
(XEN) [001a3b103720] Bootloader: GRUB 2.02~beta3-5+deb9u1
(XEN) [001a3e301032] Command line: dom0_mem=2048M,max:2048M loglvl=all 
console_timestamps=datems vga=gfx-1280x1024x32 no-cpuidle com1=38400,8n1 
console=vga,com1 ivrs_ioapic[6]=00:14.0 iommu=on,verbose,debug 
conring_size=128k ucode=scan sched=credit2 gnttab_max_frames=64 dom0=pvh
(XEN) [001a4c523f88] Xen image load base address: 0
(XEN) [001a4f25e0ce] Video information:
(XEN) [001a516122f2]  VGA is graphics mode 1280x1024, 32 bpp
(XEN) [001a54a70e70]  VBE/DDC methods: V2; EDID transfer time: 1 seconds
(XEN) [001a58853aa0] Disc information:
(XEN) [001a5ab3d4b3]  Found 4 MBR signatures
(XEN) [001a5d2ea75b]  Found 4 EDD information structures
(XEN) [001a6041c91e] Xen-e820 RAM map:
(XEN) [001a627060dd]   - 00096400 (usable)
(XEN) [001a660275e3]  00096400 - 000a (reserved)
(XEN) [001a69add13e]  000e4000 - 0010 (reserved)
(XEN) [001a6d594f0e]  0010 - c7f9 (usable)
(XEN) [001a70eb6416]  c7f9 - c7f9e000 (ACPI data)
(XEN) [001a74a37cce]  c7f9e000 - c7fe (ACPI NVS)
(XEN) [001a784ef83d]  c7fe - c800 (reserved)
(XEN) [001a7bfa7c9d]  ffe0 - 0001 (reserved)
(XEN) [001a7fa5daee]  0001 - 00053800 (usable)
(XEN) [001a88afa99a] New Xen image base address: 0xc780
(XEN) [001a8be8c9cd] ACPI: RSDP 000FB100, 0014 (r0 ACPIAM)
(XEN) [001a8f15516d] ACPI: RSDT C7F9, 0048 (r1 MSIOEMSLIC  20100913 
MSFT   97)
(XEN) [001a93d81daa] ACPI: FACP C7F90200, 0084 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001a989ad923] ACPI: DSDT C7F905E0, 9427 (r1  A7640 A7640100  100 
INTL 20051117)
(XEN) [001a9d5d875a] ACPI: FACS C7F9E000, 0040
(XEN) [001a9ff1c7b3] ACPI: APIC C7F90390, 0088 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001aa4b484e6] ACPI: MCFG C7F90420, 003C (r1 7640MS OEMMCFG  20100913 
MSFT   97)
(XEN) [001aa9773f16] ACPI: SLIC C7F90460, 0176 (r1 MSIOEMSLIC  20100913 
MSFT   97)
(XEN) [001aae39ed46] ACPI: OEMB C7F9E040, 0072 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001ab2fcbdf3] ACPI: SRAT C7F9A5E0, 0108 (r3 AMDFAM_F_102 
AMD 1)
(XEN) [001ab7bf6e3a] ACPI: HPET C7F9A6F0, 0038 (r1 7640MS OEMHPET  20100913 
MSFT   97)
(XEN) [001abc8243f6] ACPI: IVRS C7F9A730, 0108 (r1  AMD RD890S   202031 
AMD 0)
(XEN) [001ac144f4d5] ACPI: SSDT C7F9A840, 0DA4 (r1 A M I  POWERNOW1 
AMD 1)
(XEN) [001ac607b9b8] System RAM: 20479MB (20970648kB)
(XEN) [001ad03a902b] SRAT: PXM 0 -> APIC 00 -> Node 0
(XEN) [001ad3277e5e] SRAT: PXM 0 -> APIC 01 -> Node 0
(XEN) [001ad6148a58] SRAT: PXM 0 -> APIC 02 -> Node 0
(XEN) [001ad9019acb] SRAT: PXM 0 -> APIC 03 -> Node 0
(XEN) [001adbeea038] SRAT: PXM 0 -> APIC 04 -> Node 0
(XEN) [001adedbaa18] SRAT: PXM 0 -> APIC 05 -> Node 0
(XEN) [001ae1c8a3ab] SRAT: Node 0 PXM 0 0-a
(XEN) [001ae4698e55] SRAT: Node 0 PXM 0 10-c800
(XEN) [001ae76fe63a] SRAT: Node 0 PXM 0 1-53800
(XEN) [001aeaa91502] NUMA: Allocated memnodemap from 5334dc000 - 5334e2000
(XEN) [001aeea0cfeb] NUMA: Using 8 for the hash shift.
(XEN) [001b34b66796] Domain heap initialised
(XEN) [001b37313a56] Allocated console ring of 128 KiB.
(XEN) [001b4d5599fd] vesafb: framebuffer at 0xd000, mapped to 
0x82c000201000, using 6144k, total 16384k
(XEN) [001b5322fbe3] vesafb: mode is 1280x1024x32, linelength=5120, font 
8x16
(XEN) [001b5740c1a6] vesafb: Truecolor: size=0:8:8:8, shift=0:16:8:0
(XEN) [001b5aecec50] CPU Vendor: AMD, Family 16 (0x10), Model 10 (0xa), 
Stepping 0 (raw 00100fa0)
(XEN) [001b64e9ba53] found SMP MP-table at 000ff780
(XEN) [001b67bd596d] DMI present.
(XEN) [001b69ac60a3] Using APIC driver default
(XEN) [001b6c408c8b] ACPI: PM-Timer IO Port: 0x808 (24 bits)
(XEN) [001b6f8664fd] ACPI: SLEEP INFO: pm1x_cnt[1:804,1:

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-01-18 Thread Sander Eikelenboom
On 18/01/2019 13:50, Roger Pau Monné wrote:
> On Fri, Jan 18, 2019 at 01:03:04PM +0100, Sander Eikelenboom wrote:
>> Hi Roger,
>>
>> I gave PVH dom0 a spin, see how far I would get.
> 
> Thanks!
> 
>> With current xen-unstable unfortunately not that far, i got the splat below.
> 
> Yes, this was already reported:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01030.html
>> If you need more info, would like me to test a patch (or some other git 
>> tree/branch), 
>> I will be happy to give it a spin !
> 
> Paul is working on a fix, but in the meantime just removing the
> assertions should be fine:
> 
> ---8<---
> diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
> index bd1af35a13..98e6fc35e2 100644
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -321,9 +321,6 @@ int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
>  if ( !iommu_enabled || !hd->platform_ops )
>  return 0;
>  
> -ASSERT(IS_ALIGNED(dfn_x(dfn), (1ul << page_order)));
> -ASSERT(IS_ALIGNED(mfn_x(mfn), (1ul << page_order)));
> -
>  for ( i = 0; i < (1ul << page_order); i++ )
>  {
>  rc = hd->platform_ops->map_page(d, dfn_add(dfn, i), mfn_add(mfn, i),
> 

I gave that a spin and i now get a seemingly endless stream of IO_PAGE_FAULTs

 __  ___  __ ___  
 \ \/ /___ _ __   | || |  / |___ \  / _ \_ __ ___ 
  \  // _ \ '_ \  | || |_ | | __) || | | |__| '__/ __|
  /  \  __/ | | | |__   _|| |/ __/ | |_| |__| | | (__ 
 /_/\_\___|_| |_||_|(_)_|_(_)___/   |_|  \___|
  
(XEN) [001a375bd3d2] Xen version 4.12.0-rc (r...@dyndns.org) (gcc (Debian 
6.3.0-18+deb9u1) 6.3.0 20170516) debug=y  Fri Jan 18 14:47:30 CET 2019
(XEN) [001a3ecc2ebe] Latest ChangeSet: Mon Jan 14 14:59:37 2019 + 
git:50923ade7a-dirty
(XEN) [001a438ee0b2] Bootloader: GRUB 2.02~beta3-5+deb9u1
(XEN) [001a46aeb5fb] Command line: dom0_mem=2048M,max:2048M loglvl=all 
console_timestamps=datems vga=gfx-1280x1024x32 no-cpuidle com1=38400,8n1 
console=vga,com1 ivrs_ioapic[6]=00:14.0 iommu=on,verbose,debug 
conring_size=128k ucode=scan sched=credit2 gnttab_max_frames=64 dom0=pvh
(XEN) [001a54d0d32a] Xen image load base address: 0
(XEN) [001a57a46f8b] Video information:
(XEN) [001a59dfcfe2]  VGA is graphics mode 1280x1024, 32 bpp
(XEN) [001a5d25b312]  VBE/DDC methods: V2; EDID transfer time: 1 seconds
(XEN) [001a6103dee2] Disc information:
(XEN) [001a633271db]  Found 4 MBR signatures
(XEN) [001a65ad4242]  Found 4 EDD information structures
(XEN) [001a68c0622e] Xen-e820 RAM map:
(XEN) [001a6aeef328]   - 00096400 (usable)
(XEN) [001a6e80ff92]  00096400 - 000a (reserved)
(XEN) [001a722c8335]  000e4000 - 0010 (reserved)
(XEN) [001a75d7e455]  0010 - c7f9 (usable)
(XEN) [001a796a0d3b]  c7f9 - c7f9e000 (ACPI data)
(XEN) [001a7d22145b]  c7f9e000 - c7fe (ACPI NVS)
(XEN) [001a80cd91ce]  c7fe - c800 (reserved)
(XEN) [001a84791428]  ffe0 - 0001 (reserved)
(XEN) [001a88248f32]  0001 - 00053800 (usable)
(XEN) [001a91301890] New Xen image base address: 0xc780
(XEN) [001a946946fe] ACPI: RSDP 000FB100, 0014 (r0 ACPIAM)
(XEN) [001a9795cb9d] ACPI: RSDT C7F9, 0048 (r1 MSIOEMSLIC  20100913 
MSFT   97)
(XEN) [001a9c58a075] ACPI: FACP C7F90200, 0084 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001aa11b5236] ACPI: DSDT C7F905E0, 9427 (r1  A7640 A7640100  100 
INTL 20051117)
(XEN) [001aa5de0cca] ACPI: FACS C7F9E000, 0040
(XEN) [001aa8724396] ACPI: APIC C7F90390, 0088 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001aad34ed70] ACPI: MCFG C7F90420, 003C (r1 7640MS OEMMCFG  20100913 
MSFT   97)
(XEN) [001ab1f7b106] ACPI: SLIC C7F90460, 0176 (r1 MSIOEMSLIC  20100913 
MSFT   97)
(XEN) [001ab6ba8663] ACPI: OEMB C7F9E040, 0072 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001abb7d4723] ACPI: SRAT C7F9A5E0, 0108 (r3 AMDFAM_F_102 
AMD 1)
(XEN) [001ac0400a2a] ACPI: HPET C7F9A6F0, 0038 (r1 7640MS OEMHPET  20100913 
MSFT   97)
(XEN) [001ac502c9f6] ACPI: IVRS C7F9A730, 0108 (r1  AMD RD890S   202031 
AMD 0)
(XEN) [001ac9c581c5] ACPI: SSDT C7F9A840, 0DA4 (r1 A M I  POWERNOW1 
AMD 1)
(XEN) [001ace883d6e] System RAM: 20479MB (20970648kB)
(XEN) [001ad8bcf4cb] SRAT: PXM 0 -> APIC 00 -> Node 0
(XEN) [001adbaa0505] SRAT: PXM 0 -> APIC 01 -> Node 0
(XEN) [001ade96f9ea] SRAT: PXM 0 -> APIC 02 ->

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-01-20 Thread Sander Eikelenboom
On 18/01/2019 18:56, Roger Pau Monné wrote:
> On Fri, Jan 18, 2019 at 03:17:57PM +0100, Sander Eikelenboom wrote:
>> On 18/01/2019 13:50, Roger Pau Monné wrote:
>>> On Fri, Jan 18, 2019 at 01:03:04PM +0100, Sander Eikelenboom wrote:
>>>> Hi Roger,
>>>>
>>>> I gave PVH dom0 a spin, see how far I would get.
>>>
>>> Thanks!
>>>
>>>> With current xen-unstable unfortunately not that far, i got the splat 
>>>> below.
>>>
>>> Yes, this was already reported:
>>>
>>> https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01030.html
>>>> If you need more info, would like me to test a patch (or some other git 
>>>> tree/branch), 
>>>> I will be happy to give it a spin !
>>>
>>> Paul is working on a fix, but in the meantime just removing the
>>> assertions should be fine:
>>>
>>> ---8<---
>>> diff --git a/xen/drivers/passthrough/iommu.c 
>>> b/xen/drivers/passthrough/iommu.c
>>> index bd1af35a13..98e6fc35e2 100644
>>> --- a/xen/drivers/passthrough/iommu.c
>>> +++ b/xen/drivers/passthrough/iommu.c
>>> @@ -321,9 +321,6 @@ int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
>>>  if ( !iommu_enabled || !hd->platform_ops )
>>>  return 0;
>>>  
>>> -ASSERT(IS_ALIGNED(dfn_x(dfn), (1ul << page_order)));
>>> -ASSERT(IS_ALIGNED(mfn_x(mfn), (1ul << page_order)));
>>> -
>>>  for ( i = 0; i < (1ul << page_order); i++ )
>>>  {
>>>  rc = hd->platform_ops->map_page(d, dfn_add(dfn, i), mfn_add(mfn, 
>>> i),
>>>
>>
>> I gave that a spin and i now get a seemingly endless stream of IO_PAGE_FAULTs
> 
> You shouldn't get those page faults since they are for addresses that
> belong to a reserved region, and that should be mapped into the p2m.
> I've just tested on my AMD box and I'm also seeing errors (albeit
> different ones), so I guess something broke since I last fixed PVH
> Dom0 to boot on AMD hardware.
> 
> I've also tested commit:
> 
> commit fad6ba64a8c98bebb9374f390cc255fac05237ab (HEAD)
> Author: Roger Pau Monné 
> Date:   Fri Nov 30 12:10:00 2018 +0100
> amd/iommu: skip host bridge devices when updating IOMMU page tables
> 
> And it works on my AMD box and I'm able to boot as a PVH Dom0. Can you
> give this commit a spin?
> 
> Thanks, Roger.
> 

Hi Roger,

Tested that commit, but that didn't help.

I added some debug logging (to xen-unstable + Paul's patch) and found out the 
devices that are giving
the IO_PAGE_FAULTs are the onboard USB controllers (0x90 0x92 0x98 0xa5).

If I skip calling "amd_iommu_setup_domain_device" for these devices I can at 
least get it to boot a linux kernel 
(which than gives problems with the SATA controller, but that would be a next 
step). The patch I used is below.

I attached the output from lspci -vvvknn, perhaps you can spot something odd ?
When booting dom0 as PV instead of PVH everything boots and works fine.

--
Sander




diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c 
b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 33a3798f36..cc82c4b08d 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ @@ static int amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 return -ENODEV;
 }

+if (PCI_SLOT(devfn) == 0x12 || PCI_SLOT(devfn) == 0x13 || PCI_SLOT(devfn) 
== 0x16 || ( PCI_SLOT(devfn) == 0x14 && PCI_FUNC(devfn) == 5) ){
+AMD_IOMMU_DEBUG("%s ?!?!? SKIPPING %d/%d %04x:%02x:%02x.%u\n",
+__func__, pdev->domain->domain_id, 
is_hardware_domain(pdev->domain) ? 1 : 0, pdev->seg, pdev->bus, PCI_SLOT(devfn),
+PCI_FUNC(devfn));
+
+   return 0;
+}
+
 amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 return 0;
 }
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RD890 
Northbridge only single slot PCI-e GFX Hydra part [1002:5a11] (rev 02)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RD890 Northbridge 
only single slot PCI-e GFX Hydra part [1002:5a11]
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- 
Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee0100c  Data: 4128
Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+

00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-01-23 Thread Sander Eikelenboom
On 23/01/2019 19:25, Roger Pau Monné wrote:
> On Wed, Jan 23, 2019 at 12:39:21AM +0100, Sander Eikelenboom wrote:
>> On 22/01/2019 17:14, Roger Pau Monné wrote:
>>> On Sun, Jan 20, 2019 at 11:09:25PM +0100, Sander Eikelenboom wrote:
>>>> On 18/01/2019 18:56, Roger Pau Monné wrote:
>>>>> On Fri, Jan 18, 2019 at 03:17:57PM +0100, Sander Eikelenboom wrote:
>>>>>> On 18/01/2019 13:50, Roger Pau Monné wrote:
>>>>>>> On Fri, Jan 18, 2019 at 01:03:04PM +0100, Sander Eikelenboom wrote:
>>>>>>>> Hi Roger,
>>>>>>>>
>>>>>>>> I gave PVH dom0 a spin, see how far I would get.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>> With current xen-unstable unfortunately not that far, i got the splat 
>>>>>>>> below.
>>>>>>>
>>>>>>> Yes, this was already reported:
>>>>>>>
>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01030.html
>>>>>>>> If you need more info, would like me to test a patch (or some other 
>>>>>>>> git tree/branch), 
>>>>>>>> I will be happy to give it a spin !
>>>>>>>
>>>>>>> Paul is working on a fix, but in the meantime just removing the
>>>>>>> assertions should be fine:
>>>>>>>
>>>>>>> ---8<---
>>>>>>> diff --git a/xen/drivers/passthrough/iommu.c 
>>>>>>> b/xen/drivers/passthrough/iommu.c
>>>>>>> index bd1af35a13..98e6fc35e2 100644
>>>>>>> --- a/xen/drivers/passthrough/iommu.c
>>>>>>> +++ b/xen/drivers/passthrough/iommu.c
>>>>>>> @@ -321,9 +321,6 @@ int iommu_map(struct domain *d, dfn_t dfn, mfn_t 
>>>>>>> mfn,
>>>>>>>  if ( !iommu_enabled || !hd->platform_ops )
>>>>>>>  return 0;
>>>>>>>  
>>>>>>> -ASSERT(IS_ALIGNED(dfn_x(dfn), (1ul << page_order)));
>>>>>>> -ASSERT(IS_ALIGNED(mfn_x(mfn), (1ul << page_order)));
>>>>>>> -
>>>>>>>  for ( i = 0; i < (1ul << page_order); i++ )
>>>>>>>  {
>>>>>>>  rc = hd->platform_ops->map_page(d, dfn_add(dfn, i), 
>>>>>>> mfn_add(mfn, i),
>>>>>>>
>>>>>>
>>>>>> I gave that a spin and i now get a seemingly endless stream of 
>>>>>> IO_PAGE_FAULTs
>>>>>
>>>>> You shouldn't get those page faults since they are for addresses that
>>>>> belong to a reserved region, and that should be mapped into the p2m.
>>>>> I've just tested on my AMD box and I'm also seeing errors (albeit
>>>>> different ones), so I guess something broke since I last fixed PVH
>>>>> Dom0 to boot on AMD hardware.
>>>>>
>>>>> I've also tested commit:
>>>>>
>>>>> commit fad6ba64a8c98bebb9374f390cc255fac05237ab (HEAD)
>>>>> Author: Roger Pau Monné 
>>>>> Date:   Fri Nov 30 12:10:00 2018 +0100
>>>>> amd/iommu: skip host bridge devices when updating IOMMU page tables
>>>>>
>>>>> And it works on my AMD box and I'm able to boot as a PVH Dom0. Can you
>>>>> give this commit a spin?
>>>>>
>>>>> Thanks, Roger.
>>>>>
>>>>
>>>> Hi Roger,
>>>>
>>>> Tested that commit, but that didn't help.
>>>
>>> Thanks! Sorry for the delay, I got sidetracked with something else.
>>
>> No problem, it's not too urgent and probably a busy time with the remaining 
>> 4.12 stuff.
>>  
>>> Can you please post the serial log when using the above commit?
>>
>> Sure, I attached a log of:
>>  - fad6ba64a8c98bebb9374f390cc255fac05237ab  dom0 PVH unsuccesful boot
>>  - fad6ba64a8c98bebb9374f390cc255fac05237ab  dom0 PVsuccesful boot
> 
> Thanks. So you get the same IO page faults.
> 
> I don't seem to be able to reproduce this behaviour on my AMD box, but
> that might be just luck. I've been finding some issues today related
> to the IOMMU, could you give the following patch a spin and paste the
> serial log that you get.

Hi Roger,

Sure, on top of what ?
- fad6ba64a8c98bebb9374f390cc255fac05237ab ?
- xen-unstable ?
- xen-unstable + Paul's patch ?

--
Sander

> Thanks, Roger.
> ---8<---
> diff --git a/xen/drivers/passthrough/x86/iommu.c 
> b/xen/drivers/passthrough/x86/iommu.c
> index e40d7a7d7b..4fd75d4105 100644
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -241,10 +241,11 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain 
> *d)
>  
>  if ( !hwdom_iommu_map(d, pfn, max_pfn) )
>  continue;
> -
> +#if 0
>  if ( paging_mode_translate(d) )
>  rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
>  else
> +#endif
>  rc = iommu_map(d, _dfn(pfn), _mfn(pfn), PAGE_ORDER_4K,
> IOMMUF_readable | IOMMUF_writable, &flush_flags);
>  
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-01-24 Thread Sander Eikelenboom
On 24/01/2019 08:50, Roger Pau Monné wrote:
> On Wed, Jan 23, 2019 at 08:56:48PM +0100, Sander Eikelenboom wrote:
>> On 23/01/2019 19:25, Roger Pau Monné wrote:
>>> On Wed, Jan 23, 2019 at 12:39:21AM +0100, Sander Eikelenboom wrote:
>>>> On 22/01/2019 17:14, Roger Pau Monné wrote:
>>>>> On Sun, Jan 20, 2019 at 11:09:25PM +0100, Sander Eikelenboom wrote:
>>>>>> On 18/01/2019 18:56, Roger Pau Monné wrote:
>>>>>>> On Fri, Jan 18, 2019 at 03:17:57PM +0100, Sander Eikelenboom wrote:
>>>>>>>> On 18/01/2019 13:50, Roger Pau Monné wrote:
>>>>>>>>> On Fri, Jan 18, 2019 at 01:03:04PM +0100, Sander Eikelenboom wrote:
>>>>>>>>>> Hi Roger,
>>>>>>>>>>
>>>>>>>>>> I gave PVH dom0 a spin, see how far I would get.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>> With current xen-unstable unfortunately not that far, i got the 
>>>>>>>>>> splat below.
>>>>>>>>>
>>>>>>>>> Yes, this was already reported:
>>>>>>>>>
>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01030.html
>>>>>>>>>> If you need more info, would like me to test a patch (or some other 
>>>>>>>>>> git tree/branch), 
>>>>>>>>>> I will be happy to give it a spin !
>>>>>>>>>
>>>>>>>>> Paul is working on a fix, but in the meantime just removing the
>>>>>>>>> assertions should be fine:
>>>>>>>>>
>>>>>>>>> ---8<---
>>>>>>>>> diff --git a/xen/drivers/passthrough/iommu.c 
>>>>>>>>> b/xen/drivers/passthrough/iommu.c
>>>>>>>>> index bd1af35a13..98e6fc35e2 100644
>>>>>>>>> --- a/xen/drivers/passthrough/iommu.c
>>>>>>>>> +++ b/xen/drivers/passthrough/iommu.c
>>>>>>>>> @@ -321,9 +321,6 @@ int iommu_map(struct domain *d, dfn_t dfn, mfn_t 
>>>>>>>>> mfn,
>>>>>>>>>  if ( !iommu_enabled || !hd->platform_ops )
>>>>>>>>>  return 0;
>>>>>>>>>  
>>>>>>>>> -ASSERT(IS_ALIGNED(dfn_x(dfn), (1ul << page_order)));
>>>>>>>>> -ASSERT(IS_ALIGNED(mfn_x(mfn), (1ul << page_order)));
>>>>>>>>> -
>>>>>>>>>  for ( i = 0; i < (1ul << page_order); i++ )
>>>>>>>>>  {
>>>>>>>>>  rc = hd->platform_ops->map_page(d, dfn_add(dfn, i), 
>>>>>>>>> mfn_add(mfn, i),
>>>>>>>>>
>>>>>>>>
>>>>>>>> I gave that a spin and i now get a seemingly endless stream of 
>>>>>>>> IO_PAGE_FAULTs
>>>>>>>
>>>>>>> You shouldn't get those page faults since they are for addresses that
>>>>>>> belong to a reserved region, and that should be mapped into the p2m.
>>>>>>> I've just tested on my AMD box and I'm also seeing errors (albeit
>>>>>>> different ones), so I guess something broke since I last fixed PVH
>>>>>>> Dom0 to boot on AMD hardware.
>>>>>>>
>>>>>>> I've also tested commit:
>>>>>>>
>>>>>>> commit fad6ba64a8c98bebb9374f390cc255fac05237ab (HEAD)
>>>>>>> Author: Roger Pau Monné 
>>>>>>> Date:   Fri Nov 30 12:10:00 2018 +0100
>>>>>>> amd/iommu: skip host bridge devices when updating IOMMU page tables
>>>>>>>
>>>>>>> And it works on my AMD box and I'm able to boot as a PVH Dom0. Can you
>>>>>>> give this commit a spin?
>>>>>>>
>>>>>>> Thanks, Roger.
>>>>>>>
>>>>>>
>>>>>> Hi Roger,
>>>>>>
>>>>>> Tested that commit, but that didn't help.
>>>>>
>>>>> Thanks! Sorry for the delay, I got sidetracked with something else.
>>>>
>>>> No problem, it's not too u

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-01-24 Thread Sander Eikelenboom
On 24/01/2019 11:11, Roger Pau Monné wrote:
> On Thu, Jan 24, 2019 at 10:25:33AM +0100, Sander Eikelenboom wrote:
>> On 24/01/2019 08:50, Roger Pau Monné wrote:
>>> On Wed, Jan 23, 2019 at 08:56:48PM +0100, Sander Eikelenboom wrote:
>>>> On 23/01/2019 19:25, Roger Pau Monné wrote:
>>>>> On Wed, Jan 23, 2019 at 12:39:21AM +0100, Sander Eikelenboom wrote:
>>>>>> On 22/01/2019 17:14, Roger Pau Monné wrote:
>>>>>>> On Sun, Jan 20, 2019 at 11:09:25PM +0100, Sander Eikelenboom wrote:
>>>>>>>> On 18/01/2019 18:56, Roger Pau Monné wrote:
>>>>>>>>> On Fri, Jan 18, 2019 at 03:17:57PM +0100, Sander Eikelenboom wrote:
>>>>>>>>>> On 18/01/2019 13:50, Roger Pau Monné wrote:
>>>>>>>>>>> On Fri, Jan 18, 2019 at 01:03:04PM +0100, Sander Eikelenboom wrote:
>>>>>>>>>>>> Hi Roger,
>>>>>>>>>>>>
>>>>>>>>>>>> I gave PVH dom0 a spin, see how far I would get.
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>>> With current xen-unstable unfortunately not that far, i got the 
>>>>>>>>>>>> splat below.
>>>>>>>>>>>
>>>>>>>>>>> Yes, this was already reported:
>>>>>>>>>>>
>>>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01030.html
>>>>>>>>>>>> If you need more info, would like me to test a patch (or some 
>>>>>>>>>>>> other git tree/branch), 
>>>>>>>>>>>> I will be happy to give it a spin !
>>>>>>>>>>>
>>>>>>>>>>> Paul is working on a fix, but in the meantime just removing the
>>>>>>>>>>> assertions should be fine:
>>>>>>>>>>>
>>>>>>>>>>> ---8<---
>>>>>>>>>>> diff --git a/xen/drivers/passthrough/iommu.c 
>>>>>>>>>>> b/xen/drivers/passthrough/iommu.c
>>>>>>>>>>> index bd1af35a13..98e6fc35e2 100644
>>>>>>>>>>> --- a/xen/drivers/passthrough/iommu.c
>>>>>>>>>>> +++ b/xen/drivers/passthrough/iommu.c
>>>>>>>>>>> @@ -321,9 +321,6 @@ int iommu_map(struct domain *d, dfn_t dfn, 
>>>>>>>>>>> mfn_t mfn,
>>>>>>>>>>>  if ( !iommu_enabled || !hd->platform_ops )
>>>>>>>>>>>  return 0;
>>>>>>>>>>>  
>>>>>>>>>>> -ASSERT(IS_ALIGNED(dfn_x(dfn), (1ul << page_order)));
>>>>>>>>>>> -ASSERT(IS_ALIGNED(mfn_x(mfn), (1ul << page_order)));
>>>>>>>>>>> -
>>>>>>>>>>>  for ( i = 0; i < (1ul << page_order); i++ )
>>>>>>>>>>>  {
>>>>>>>>>>>  rc = hd->platform_ops->map_page(d, dfn_add(dfn, i), 
>>>>>>>>>>> mfn_add(mfn, i),
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I gave that a spin and i now get a seemingly endless stream of 
>>>>>>>>>> IO_PAGE_FAULTs
>>>>>>>>>
>>>>>>>>> You shouldn't get those page faults since they are for addresses that
>>>>>>>>> belong to a reserved region, and that should be mapped into the p2m.
>>>>>>>>> I've just tested on my AMD box and I'm also seeing errors (albeit
>>>>>>>>> different ones), so I guess something broke since I last fixed PVH
>>>>>>>>> Dom0 to boot on AMD hardware.
>>>>>>>>>
>>>>>>>>> I've also tested commit:
>>>>>>>>>
>>>>>>>>> commit fad6ba64a8c98bebb9374f390cc255fac05237ab (HEAD)
>>>>>>>>> Author: Roger Pau Monné 
>>>>>>>>> Date:   Fri Nov 30 12:10:00 2018 +0100
>>>>>>>>> amd/iommu: skip host bridge devices when updating IOMMU pa

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-01-25 Thread Sander Eikelenboom
On 25/01/2019 15:38, Roger Pau Monné wrote:
> On Thu, Jan 24, 2019 at 01:04:31PM +0100, Roger Pau Monné wrote:
>> On Thu, Jan 24, 2019 at 12:55:06PM +0100, Sander Eikelenboom wrote:
>>> On 24/01/2019 11:11, Roger Pau Monné wrote:
>>>> On Thu, Jan 24, 2019 at 10:25:33AM +0100, Sander Eikelenboom wrote:
>>>>> On 24/01/2019 08:50, Roger Pau Monné wrote:
>>>>>> On Wed, Jan 23, 2019 at 08:56:48PM +0100, Sander Eikelenboom wrote:
>>>>>>> On 23/01/2019 19:25, Roger Pau Monné wrote:
>>>>>>>> On Wed, Jan 23, 2019 at 12:39:21AM +0100, Sander Eikelenboom wrote:
>>>>>>>>> On 22/01/2019 17:14, Roger Pau Monné wrote:
>>>>>>>>>> On Sun, Jan 20, 2019 at 11:09:25PM +0100, Sander Eikelenboom wrote:
>>>>>>>>>>> On 18/01/2019 18:56, Roger Pau Monné wrote:
>>>>>>>>>>>> On Fri, Jan 18, 2019 at 03:17:57PM +0100, Sander Eikelenboom wrote:
>>>>>>>>>>>>> On 18/01/2019 13:50, Roger Pau Monné wrote:
>>>>>>>>>>>>>> On Fri, Jan 18, 2019 at 01:03:04PM +0100, Sander Eikelenboom 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Hi Roger,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I gave PVH dom0 a spin, see how far I would get.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With current xen-unstable unfortunately not that far, i got the 
>>>>>>>>>>>>>>> splat below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, this was already reported:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg01030.html
>>>>>>>>>>>>>>> If you need more info, would like me to test a patch (or some 
>>>>>>>>>>>>>>> other git tree/branch), 
>>>>>>>>>>>>>>> I will be happy to give it a spin !
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Paul is working on a fix, but in the meantime just removing the
>>>>>>>>>>>>>> assertions should be fine:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---8<---
>>>>>>>>>>>>>> diff --git a/xen/drivers/passthrough/iommu.c 
>>>>>>>>>>>>>> b/xen/drivers/passthrough/iommu.c
>>>>>>>>>>>>>> index bd1af35a13..98e6fc35e2 100644
>>>>>>>>>>>>>> --- a/xen/drivers/passthrough/iommu.c
>>>>>>>>>>>>>> +++ b/xen/drivers/passthrough/iommu.c
>>>>>>>>>>>>>> @@ -321,9 +321,6 @@ int iommu_map(struct domain *d, dfn_t dfn, 
>>>>>>>>>>>>>> mfn_t mfn,
>>>>>>>>>>>>>>  if ( !iommu_enabled || !hd->platform_ops )
>>>>>>>>>>>>>>  return 0;
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> -ASSERT(IS_ALIGNED(dfn_x(dfn), (1ul << page_order)));
>>>>>>>>>>>>>> -ASSERT(IS_ALIGNED(mfn_x(mfn), (1ul << page_order)));
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>  for ( i = 0; i < (1ul << page_order); i++ )
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>  rc = hd->platform_ops->map_page(d, dfn_add(dfn, i), 
>>>>>>>>>>>>>> mfn_add(mfn, i),
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I gave that a spin and i now get a seemingly endless stream of 
>>>>>>>>>>>>> IO_PAGE_FAULTs
>>>>>>>>>>>>
>>>>>>>>>>>> You shouldn't ge

[Xen-devel] Xen-unstable staging build broken by pvshim patches.

2019-08-08 Thread Sander Eikelenboom
Hi Andrew,

It seems the pvshim patches in xen-unstable staging break the build on my 
machine.
I cloned a fresh tree to be sure, haven't checked which of the two commits 
causes it:
060f4eee0fb408b316548775ab921e16b7acd0e0 or 
32b1d62887d01f85f0c1d2e0103f69f74e1f6fa3

--
Sander



[ -d //usr/local/lib/xen/boot ] || 
/usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d -m0755 -p 
//usr/local/lib/xen/boot
[ -d //usr/local/lib/debug/usr/local/lib/xen/boot ] || 
/usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d -m0755 -p 
//usr/local/lib/debug/usr/local/lib/xen/boot
[ ! -e hvmloader/hvmloader ] || 
/usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -m0644 -p 
hvmloader/hvmloader //usr/local/lib/xen/boot
/usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -m0644 -p 
seabios-dir/out/bios.bin //usr/local/lib/xen/boot/seabios.bin
/usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -m0644 -p 
xen-dir/xen-shim //usr/local/lib/xen/boot/xen-shim
install: cannot stat 'xen-dir/xen-shim': No such file or directory
make[4]: *** [Makefile:52: install] Error 1
make[4]: Leaving directory '/usr/src/new/xen-unstable/tools/firmware'
make[3]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:237: 
subdir-install-firmware] Error 2
make[3]: Leaving directory '/usr/src/new/xen-unstable/tools'
make[2]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:232: 
subdirs-install] Error 2
make[2]: Leaving directory '/usr/src/new/xen-unstable/tools'
make[1]: *** [Makefile:73: install] Error 2
make[1]: Leaving directory '/usr/src/new/xen-unstable/tools'
make: *** [Makefile:131: install-tools] Error 2

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [SUSPECTED SPAM]Xen-unstable staging build broken by pvshim patches.

2019-08-08 Thread Sander Eikelenboom
On 08/08/2019 23:05, Andrew Cooper wrote:
> On 08/08/2019 21:59, Sander Eikelenboom wrote:
>> Hi Andrew,
>>
>> It seems the pvshim patches in xen-unstable staging break the build on my 
>> machine.
>> I cloned a fresh tree to be sure, haven't checked which of the two commits 
>> causes it:
>> 060f4eee0fb408b316548775ab921e16b7acd0e0 or 
>> 32b1d62887d01f85f0c1d2e0103f69f74e1f6fa3
>>
>> --
>> Sander
>>
>>
>>
>> [ -d //usr/local/lib/xen/boot ] || 
>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d -m0755 
>> -p //usr/local/lib/xen/boot
>> [ -d //usr/local/lib/debug/usr/local/lib/xen/boot ] || 
>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d -m0755 
>> -p //usr/local/lib/debug/usr/local/lib/xen/boot
>> [ ! -e hvmloader/hvmloader ] || 
>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -m0644 -p 
>> hvmloader/hvmloader //usr/local/lib/xen/boot
>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -m0644 -p 
>> seabios-dir/out/bios.bin //usr/local/lib/xen/boot/seabios.bin
>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -m0644 -p 
>> xen-dir/xen-shim //usr/local/lib/xen/boot/xen-shim
>> install: cannot stat 'xen-dir/xen-shim': No such file or directory
>> make[4]: *** [Makefile:52: install] Error 1
>> make[4]: Leaving directory '/usr/src/new/xen-unstable/tools/firmware'
>> make[3]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:237: 
>> subdir-install-firmware] Error 2
>> make[3]: Leaving directory '/usr/src/new/xen-unstable/tools'
>> make[2]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:232: 
>> subdirs-install] Error 2
>> make[2]: Leaving directory '/usr/src/new/xen-unstable/tools'
>> make[1]: *** [Makefile:73: install] Error 2
>> make[1]: Leaving directory '/usr/src/new/xen-unstable/tools'
>> make: *** [Makefile:131: install-tools] Error 2
> 
> That's weird.
> 
> Do you have the full log?  The real failure was somewhere earlier where
> xen-shim didn't get started.
> 
> ~Andrew
> 

Hmm if forgot and thus forgot to mention my build script disables some stuff:
./configure --disable-qemu-traditional --disable-stubdom --disable-docs 
--disable-rombios

Could be that one of those doesn't work anymore.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [SUSPECTED SPAM]Xen-unstable staging build broken by pvshim patches.

2019-08-08 Thread Sander Eikelenboom
On 09/08/2019 00:44, Andrew Cooper wrote:
> On 08/08/2019 23:34, Sander Eikelenboom wrote:
>> On 08/08/2019 23:14, Andrew Cooper wrote:
>>> On 08/08/2019 22:16, Sander Eikelenboom wrote:
>>>> On 08/08/2019 23:05, Andrew Cooper wrote:
>>>>> On 08/08/2019 21:59, Sander Eikelenboom wrote:
>>>>>> Hi Andrew,
>>>>>>
>>>>>> It seems the pvshim patches in xen-unstable staging break the build on 
>>>>>> my machine.
>>>>>> I cloned a fresh tree to be sure, haven't checked which of the two 
>>>>>> commits causes it:
>>>>>> 060f4eee0fb408b316548775ab921e16b7acd0e0 or 
>>>>>> 32b1d62887d01f85f0c1d2e0103f69f74e1f6fa3
>>>>>>
>>>>>> --
>>>>>> Sander
>>>>>>
>>>>>>
>>>>>>
>>>>>> [ -d //usr/local/lib/xen/boot ] || 
>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d 
>>>>>> -m0755 -p //usr/local/lib/xen/boot
>>>>>> [ -d //usr/local/lib/debug/usr/local/lib/xen/boot ] || 
>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d 
>>>>>> -m0755 -p //usr/local/lib/debug/usr/local/lib/xen/boot
>>>>>> [ ! -e hvmloader/hvmloader ] || 
>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>> -m0644 -p hvmloader/hvmloader //usr/local/lib/xen/boot
>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>> -m0644 -p seabios-dir/out/bios.bin //usr/local/lib/xen/boot/seabios.bin
>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>> -m0644 -p xen-dir/xen-shim //usr/local/lib/xen/boot/xen-shim
>>>>>> install: cannot stat 'xen-dir/xen-shim': No such file or directory
>>>>>> make[4]: *** [Makefile:52: install] Error 1
>>>>>> make[4]: Leaving directory '/usr/src/new/xen-unstable/tools/firmware'
>>>>>> make[3]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:237: 
>>>>>> subdir-install-firmware] Error 2
>>>>>> make[3]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>> make[2]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:232: 
>>>>>> subdirs-install] Error 2
>>>>>> make[2]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>> make[1]: *** [Makefile:73: install] Error 2
>>>>>> make[1]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>> make: *** [Makefile:131: install-tools] Error 2
>>>>> That's weird.
>>>>>
>>>>> Do you have the full log?  The real failure was somewhere earlier where
>>>>> xen-shim didn't get started.
>>>>>
>>>>> ~Andrew
>>>>>
>>>> Hmm if forgot and thus forgot to mention my build script disables some 
>>>> stuff:
>>>> ./configure --disable-qemu-traditional --disable-stubdom --disable-docs 
>>>> --disable-rombios
>>>>
>>>> Could be that one of those doesn't work anymore.
>>> The only interesting one would be --disable-rombios, which does make
>>> changes in this area of the build, but everything I changed was inside
>>> the xen-dir/ directory so shouldn't interact.>
>>> ~Andrew
>>>
>> It indeed seems to be some interaction with --disable-rombios, with just
>> a plain ./configure it builds fine.
>> Logs when building with --disable-rombios are attached.
> 
> Right.  So the build itself works, but the subsequent `make install` fails.
> 
> And to confirm, a build of 8d54a6adf (the parent of my first shim
> commit) works entirely fine?
> 
> ~Andrew
> 
Just rechecked, and yes that builds and installs fine (with --disable-rombios).

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [SUSPECTED SPAM]Xen-unstable staging build broken by pvshim patches.

2019-08-13 Thread Sander Eikelenboom
On 13/08/2019 13:21, Andrew Cooper wrote:
> On 09/08/2019 00:28, Sander Eikelenboom wrote:
>> On 09/08/2019 00:44, Andrew Cooper wrote:
>>> On 08/08/2019 23:34, Sander Eikelenboom wrote:
>>>> On 08/08/2019 23:14, Andrew Cooper wrote:
>>>>> On 08/08/2019 22:16, Sander Eikelenboom wrote:
>>>>>> On 08/08/2019 23:05, Andrew Cooper wrote:
>>>>>>> On 08/08/2019 21:59, Sander Eikelenboom wrote:
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> It seems the pvshim patches in xen-unstable staging break the build on 
>>>>>>>> my machine.
>>>>>>>> I cloned a fresh tree to be sure, haven't checked which of the two 
>>>>>>>> commits causes it:
>>>>>>>> 060f4eee0fb408b316548775ab921e16b7acd0e0 or 
>>>>>>>> 32b1d62887d01f85f0c1d2e0103f69f74e1f6fa3
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sander
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [ -d //usr/local/lib/xen/boot ] || 
>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d 
>>>>>>>> -m0755 -p //usr/local/lib/xen/boot
>>>>>>>> [ -d //usr/local/lib/debug/usr/local/lib/xen/boot ] || 
>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install -d 
>>>>>>>> -m0755 -p //usr/local/lib/debug/usr/local/lib/xen/boot
>>>>>>>> [ ! -e hvmloader/hvmloader ] || 
>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>> -m0644 -p hvmloader/hvmloader //usr/local/lib/xen/boot
>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>> -m0644 -p seabios-dir/out/bios.bin //usr/local/lib/xen/boot/seabios.bin
>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>> -m0644 -p xen-dir/xen-shim //usr/local/lib/xen/boot/xen-shim
>>>>>>>> install: cannot stat 'xen-dir/xen-shim': No such file or directory
>>>>>>>> make[4]: *** [Makefile:52: install] Error 1
>>>>>>>> make[4]: Leaving directory '/usr/src/new/xen-unstable/tools/firmware'
>>>>>>>> make[3]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:237: 
>>>>>>>> subdir-install-firmware] Error 2
>>>>>>>> make[3]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>> make[2]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:232: 
>>>>>>>> subdirs-install] Error 2
>>>>>>>> make[2]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>> make[1]: *** [Makefile:73: install] Error 2
>>>>>>>> make[1]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>> make: *** [Makefile:131: install-tools] Error 2
>>>>>>> That's weird.
>>>>>>>
>>>>>>> Do you have the full log?  The real failure was somewhere earlier where
>>>>>>> xen-shim didn't get started.
>>>>>>>
>>>>>>> ~Andrew
>>>>>>>
>>>>>> Hmm if forgot and thus forgot to mention my build script disables some 
>>>>>> stuff:
>>>>>> ./configure --disable-qemu-traditional --disable-stubdom --disable-docs 
>>>>>> --disable-rombios
>>>>>>
>>>>>> Could be that one of those doesn't work anymore.
>>>>> The only interesting one would be --disable-rombios, which does make
>>>>> changes in this area of the build, but everything I changed was inside
>>>>> the xen-dir/ directory so shouldn't interact.>
>>>>> ~Andrew
>>>>>
>>>> It indeed seems to be some interaction with --disable-rombios, with just
>>>> a plain ./configure it builds fine.
>>>> Logs when building with --disable-rombios are attached.
>>> Right.  So the build itself works, but the subsequent `make install` fails.
>>>
>>> And to confirm, a build of 8d54a6adf (the parent of my first shim
>>> commit) works entirely fine?
>>>
>>> ~Andrew
>>>
>> Just rechecked, and yes that builds and installs fine (with 
>> --disable-rombios).
> 
> Which base distro are you using?  I'm unable to reproduce any build
> failures locally.
> 
> ~Andrew
> 

Debian 10 / Buster.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable staging build broken by pvshim patches.

2019-08-13 Thread Sander Eikelenboom
On 13/08/2019 15:31, Andrew Cooper wrote:
> On 13/08/2019 12:51, Sander Eikelenboom wrote:
>> On 13/08/2019 13:21, Andrew Cooper wrote:
>>> On 09/08/2019 00:28, Sander Eikelenboom wrote:
>>>> On 09/08/2019 00:44, Andrew Cooper wrote:
>>>>> On 08/08/2019 23:34, Sander Eikelenboom wrote:
>>>>>> On 08/08/2019 23:14, Andrew Cooper wrote:
>>>>>>> On 08/08/2019 22:16, Sander Eikelenboom wrote:
>>>>>>>> On 08/08/2019 23:05, Andrew Cooper wrote:
>>>>>>>>> On 08/08/2019 21:59, Sander Eikelenboom wrote:
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> It seems the pvshim patches in xen-unstable staging break the build 
>>>>>>>>>> on my machine.
>>>>>>>>>> I cloned a fresh tree to be sure, haven't checked which of the two 
>>>>>>>>>> commits causes it:
>>>>>>>>>> 060f4eee0fb408b316548775ab921e16b7acd0e0 or 
>>>>>>>>>> 32b1d62887d01f85f0c1d2e0103f69f74e1f6fa3
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sander
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [ -d //usr/local/lib/xen/boot ] || 
>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>> -d -m0755 -p //usr/local/lib/xen/boot
>>>>>>>>>> [ -d //usr/local/lib/debug/usr/local/lib/xen/boot ] || 
>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>> -d -m0755 -p //usr/local/lib/debug/usr/local/lib/xen/boot
>>>>>>>>>> [ ! -e hvmloader/hvmloader ] || 
>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>> -m0644 -p hvmloader/hvmloader //usr/local/lib/xen/boot
>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>> -m0644 -p seabios-dir/out/bios.bin 
>>>>>>>>>> //usr/local/lib/xen/boot/seabios.bin
>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>> -m0644 -p xen-dir/xen-shim //usr/local/lib/xen/boot/xen-shim
>>>>>>>>>> install: cannot stat 'xen-dir/xen-shim': No such file or directory
>>>>>>>>>> make[4]: *** [Makefile:52: install] Error 1
>>>>>>>>>> make[4]: Leaving directory '/usr/src/new/xen-unstable/tools/firmware'
>>>>>>>>>> make[3]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:237: 
>>>>>>>>>> subdir-install-firmware] Error 2
>>>>>>>>>> make[3]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>>>> make[2]: *** [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:232: 
>>>>>>>>>> subdirs-install] Error 2
>>>>>>>>>> make[2]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>>>> make[1]: *** [Makefile:73: install] Error 2
>>>>>>>>>> make[1]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>>>> make: *** [Makefile:131: install-tools] Error 2
>>>>>>>>> That's weird.
>>>>>>>>>
>>>>>>>>> Do you have the full log?  The real failure was somewhere earlier 
>>>>>>>>> where
>>>>>>>>> xen-shim didn't get started.
>>>>>>>>>
>>>>>>>>> ~Andrew
>>>>>>>>>
>>>>>>>> Hmm if forgot and thus forgot to mention my build script disables some 
>>>>>>>> stuff:
>>>>>>>> ./configure --disable-qemu-traditional --disable-stubdom 
>>>>>>>> --disable-docs --disable-rombios
>>>>>>>>
>>>>>>>> Could be that one of those doesn't work anymore.
>>>>>>> The only interesting one would be --disable-rombios, which does make
>>>>>>> cha

Re: [Xen-devel] Xen-unstable staging build broken by pvshim patches.

2019-08-13 Thread Sander Eikelenboom
On 13/08/2019 23:05, Andrew Cooper wrote:
> On 13/08/2019 22:03, Sander Eikelenboom wrote:
>> On 13/08/2019 15:31, Andrew Cooper wrote:
>>> On 13/08/2019 12:51, Sander Eikelenboom wrote:
>>>> On 13/08/2019 13:21, Andrew Cooper wrote:
>>>>> On 09/08/2019 00:28, Sander Eikelenboom wrote:
>>>>>> On 09/08/2019 00:44, Andrew Cooper wrote:
>>>>>>> On 08/08/2019 23:34, Sander Eikelenboom wrote:
>>>>>>>> On 08/08/2019 23:14, Andrew Cooper wrote:
>>>>>>>>> On 08/08/2019 22:16, Sander Eikelenboom wrote:
>>>>>>>>>> On 08/08/2019 23:05, Andrew Cooper wrote:
>>>>>>>>>>> On 08/08/2019 21:59, Sander Eikelenboom wrote:
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> It seems the pvshim patches in xen-unstable staging break the 
>>>>>>>>>>>> build on my machine.
>>>>>>>>>>>> I cloned a fresh tree to be sure, haven't checked which of the two 
>>>>>>>>>>>> commits causes it:
>>>>>>>>>>>> 060f4eee0fb408b316548775ab921e16b7acd0e0 or 
>>>>>>>>>>>> 32b1d62887d01f85f0c1d2e0103f69f74e1f6fa3
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Sander
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [ -d //usr/local/lib/xen/boot ] || 
>>>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>>>> -d -m0755 -p //usr/local/lib/xen/boot
>>>>>>>>>>>> [ -d //usr/local/lib/debug/usr/local/lib/xen/boot ] || 
>>>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>>>> -d -m0755 -p //usr/local/lib/debug/usr/local/lib/xen/boot
>>>>>>>>>>>> [ ! -e hvmloader/hvmloader ] || 
>>>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>>>> -m0644 -p hvmloader/hvmloader //usr/local/lib/xen/boot
>>>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>>>> -m0644 -p seabios-dir/out/bios.bin 
>>>>>>>>>>>> //usr/local/lib/xen/boot/seabios.bin
>>>>>>>>>>>> /usr/src/new/xen-unstable/tools/firmware/../../tools/cross-install 
>>>>>>>>>>>> -m0644 -p xen-dir/xen-shim //usr/local/lib/xen/boot/xen-shim
>>>>>>>>>>>> install: cannot stat 'xen-dir/xen-shim': No such file or directory
>>>>>>>>>>>> make[4]: *** [Makefile:52: install] Error 1
>>>>>>>>>>>> make[4]: Leaving directory 
>>>>>>>>>>>> '/usr/src/new/xen-unstable/tools/firmware'
>>>>>>>>>>>> make[3]: *** 
>>>>>>>>>>>> [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:237: 
>>>>>>>>>>>> subdir-install-firmware] Error 2
>>>>>>>>>>>> make[3]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>>>>>> make[2]: *** 
>>>>>>>>>>>> [/usr/src/new/xen-unstable/tools/../tools/Rules.mk:232: 
>>>>>>>>>>>> subdirs-install] Error 2
>>>>>>>>>>>> make[2]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>>>>>> make[1]: *** [Makefile:73: install] Error 2
>>>>>>>>>>>> make[1]: Leaving directory '/usr/src/new/xen-unstable/tools'
>>>>>>>>>>>> make: *** [Makefile:131: install-tools] Error 2
>>>>>>>>>>> That's weird.
>>>>>>>>>>>
>>>>>>>>>>> Do you have the full log?  The real failure was somewhere earlier 
>>>>>>>>>>> where
>>>>>>>>>>> xen-shim didn't 

Re: [Xen-devel] Xen-unstable staging build broken by pvshim patches.

2019-08-28 Thread Sander Eikelenboom
On 28/08/2019 15:16, Andrew Cooper wrote:
> On 08/08/2019 21:59, Sander Eikelenboom wrote:
>> Hi Andrew,
>>
>> It seems the pvshim patches in xen-unstable staging break the build on my 
>> machine.
>> I cloned a fresh tree to be sure, haven't checked which of the two commits 
>> causes it:
>> 060f4eee0fb408b316548775ab921e16b7acd0e0 or 
>> 32b1d62887d01f85f0c1d2e0103f69f74e1f6fa3
> 
> So this is all horrible.  Anything which causes the linkfarm to
> regenerate breaks the final symlink, which is ultimately why the build
> fails.
> 
> I can't explain why my change altered the visible behaviour.
> 
> Can you build with the following patch and get linkfarm.stamp.{old,new}
> from a failed build please?
> ~Andrew

Sure, both attached.

--
Sander
config/Tools.mk
config/OpenBSD.mk
config/x86_32.mk
config/Paths.mk.in
config/NetBSD.mk
config/Toplevel.mk.in
config/arm64.mk
config/Docs.mk.in
config/arm32.mk
config/Tools.mk.in
config/FreeBSD.mk
config/Linux.mk
config/StdGNU.mk
config/x86_64.mk
config/Toplevel.mk
config/Paths.mk
config/MiniOS.mk
config/Stubdom.mk.in
config/NetBSDRump.mk
config/SunOS.mk
xen/Kconfig.debug
xen/.banner
xen/drivers/passthrough/x86/Makefile
xen/drivers/passthrough/x86/ats.c
xen/drivers/passthrough/x86/iommu.c
xen/drivers/passthrough/amd/iommu_map.c
xen/drivers/passthrough/amd/iommu_intr.c
xen/drivers/passthrough/amd/iommu_cmd.c
xen/drivers/passthrough/amd/iommu_init.c
xen/drivers/passthrough/amd/pci_amd_iommu.c
xen/drivers/passthrough/amd/iommu_acpi.c
xen/drivers/passthrough/amd/iommu_guest.c
xen/drivers/passthrough/amd/Makefile
xen/drivers/passthrough/amd/iommu_detect.c
xen/drivers/passthrough/io.c
xen/drivers/passthrough/ats.h
xen/drivers/passthrough/device_tree.c
xen/drivers/passthrough/Makefile
xen/drivers/passthrough/pci.c
xen/drivers/passthrough/arm/Makefile
xen/drivers/passthrough/arm/smmu.c
xen/drivers/passthrough/arm/iommu.c
xen/drivers/passthrough/Kconfig
xen/drivers/passthrough/vtd/extern.h
xen/drivers/passthrough/vtd/intremap.c
xen/drivers/passthrough/vtd/qinval.c
xen/drivers/passthrough/vtd/x86/Makefile
xen/drivers/passthrough/vtd/x86/ats.c
xen/drivers/passthrough/vtd/x86/hvm.c
xen/drivers/passthrough/vtd/x86/vtd.c
xen/drivers/passthrough/vtd/dmar.c
xen/drivers/passthrough/vtd/utils.c
xen/drivers/passthrough/vtd/dmar.h
xen/drivers/passthrough/vtd/Makefile
xen/drivers/passthrough/vtd/vtd.h
xen/drivers/passthrough/vtd/quirks.c
xen/drivers/passthrough/vtd/iommu.c
xen/drivers/passthrough/vtd/iommu.h
xen/drivers/passthrough/iommu.c
xen/drivers/pci/Makefile
xen/drivers/pci/pci.c
xen/drivers/pci/Kconfig
xen/drivers/Makefile
xen/drivers/vpci/header.c
xen/drivers/vpci/vpci.c
xen/drivers/vpci/msix.c
xen/drivers/vpci/Makefile
xen/drivers/vpci/msi.c
xen/drivers/video/lfb.h
xen/drivers/video/modelines.h
xen/drivers/video/Makefile
xen/drivers/video/font_8x14.c
xen/drivers/video/vesa.c
xen/drivers/video/font.h
xen/drivers/video/Kconfig
xen/drivers/video/vga.c
xen/drivers/video/font_8x8.c
xen/drivers/video/font_8x16.c
xen/drivers/video/lfb.c
xen/drivers/char/exynos4210-uart.c
xen/drivers/char/ns16550.c
xen/drivers/char/console.c
xen/drivers/char/consoled.c
xen/drivers/char/pl011.c
xen/drivers/char/arm-uart.c
xen/drivers/char/mvebu-uart.c
xen/drivers/char/scif-uart.c
xen/drivers/char/Makefile
xen/drivers/char/cadence-uart.c
xen/drivers/char/meson-uart.c
xen/drivers/char/Kconfig
xen/drivers/char/omap-uart.c
xen/drivers/char/xen_pv_console.c
xen/drivers/char/serial.c
xen/drivers/char/ehci-dbgp.c
xen/drivers/Kconfig
xen/drivers/acpi/tables/tbxface.c
xen/drivers/acpi/tables/tbfadt.c
xen/drivers/acpi/tables/tbinstal.c
xen/drivers/acpi/tables/Makefile
xen/drivers/acpi/tables/tbxfroot.c
xen/drivers/acpi/tables/tbutils.c
xen/drivers/acpi/pmstat.c
xen/drivers/acpi/reboot.c
xen/drivers/acpi/osl.c
xen/drivers/acpi/tables.c
xen/drivers/acpi/apei/hest.c
xen/drivers/acpi/apei/erst.c
xen/drivers/acpi/apei/apei-internal.h
xen/drivers/acpi/apei/apei-base.c
xen/drivers/acpi/apei/Makefile
xen/drivers/acpi/apei/apei-io.c
xen/drivers/acpi/Makefile
xen/drivers/acpi/hwregs.c
xen/drivers/acpi/numa.c
xen/drivers/acpi/Kconfig
xen/drivers/acpi/utilities/utmisc.c
xen/drivers/acpi/utilities/utglobal.c
xen/drivers/acpi/utilities/Makefile
xen/drivers/cpufreq/cpufreq.c
xen/drivers/cpufreq/cpufreq_ondemand.c
xen/drivers/cpufreq/Makefile
xen/drivers/cpufreq/cpufreq_misc_governors.c
xen/drivers/cpufreq/Kconfig
xen/drivers/cpufreq/utility.c
xen/Rules.mk
xen/xen
xen/xen.efi.map
xen/tools/xen.flf
xen/tools/symbols
xen/tools/compat-build-source.py
xen/tools/compat-build-header.py
xen/tools/kconfig/util.c
xen/tools/kconfig/merge_config.sh
xen/tools/kconfig/nconf.c
xen/tools/kconfig/conf
xen/tools/kconfig/Makefile.kconfig
xen/tools/kconfig/menu.c
xen/tools/kconfig/lxdialog/util.c
xen/tools/kconfig/lxdialog/check-lxdialog.sh
xen/tools/kconfig/lxdialog/menubox.c
xen/tools/kconfig/lxdialog/.gitignore
xen/tools/kconfig/lxdialog/dialog.h
xen/tool

[Xen-devel] [Xen-unstable] boot crash while loading AMD microcode due to commit "microcode/amd: fix memory leak"

2019-08-29 Thread Sander Eikelenboom
L.S.,

While testing xen-unstable, my AMD system crashes during early boot while 
loading microcode with an "Early fatal page fault".
Reverting commit de45e3ff37bb1602796054afabfa626ea5661c45 "microcode/amd: fix 
memory leak" fixes the boot issue.

At present I don't have my serial console stuff at hand, but if needed I can 
send the stacktrace tomorrow.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-02-08 Thread Sander Eikelenboom
On 08/02/2019 16:10, Roger Pau Monné wrote:
> On Fri, Jan 25, 2019 at 07:44:40PM +0100, Sander Eikelenboom wrote:
>> On 25/01/2019 15:38, Roger Pau Monné wrote:
>>> On Thu, Jan 24, 2019 at 01:04:31PM +0100, Roger Pau Monné wrote:
>>> Sorry, fixing that error took longer than expected, and requires
>>> modifying quite a lot of code, so I'm not sure whether it's something
>>> to consider for 4.12, I have to think about it.
>>
>> I understand, especially since PVH dom0 is marked as experimental.
>>
>>> In the meantime, can you please test the following branch:
>>>
>>> git://xenbits.xen.org/people/royger/xen.git iommu-fixes-v2
>>>
>>> I've been able to successfully create a PVH guest from a PVH dom0 on
>>> AMD hardware using this branch.
>>
>> On the other hand, with a quick test I can confirm that booting a PVH guest 
>> from a PVH dom0 now works for me as well ! 
>> (and booting this build as PV dom0, with my normal PVH/HVM mix of guests 
>> still works, no regressions for me so far)


 
> Sorry for bothering you again, but could you give the following branch
> a test:

No problem, happy to keep testing until it works and is in good enough shape to 
get committed.

And fortunately you asked, because unfortunately it doesn't boot as pvhdom0, 
serial log is attached.

--
Sander

> 
> git://xenbits.xen.org/people/royger/xen.git fixes-4.12-v2
> 
> You should hopefully be able to boot a pvh dom0 and create guests from
> it on your AMD hardware.
> 
> Thanks, Roger.
> 

 __  ___  __ ___  
 \ \/ /___ _ __   | || |  / |___ \  / _ \_ __ ___ 
  \  // _ \ '_ \  | || |_ | | __) || | | |__| '__/ __|
  /  \  __/ | | | |__   _|| |/ __/ | |_| |__| | | (__ 
 /_/\_\___|_| |_||_|(_)_|_(_)___/   |_|  \___|
  
(XEN) [001a2b740615] Xen version 4.12.0-rc (r...@dyndns.org) (gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516) debug=y  Fri Feb  8 16:55:33 CET 2019
(XEN) [001a32e440c5] Latest ChangeSet: Fri Feb 8 15:01:58 2019 +0100 git:ed104d12b9-dirty
(XEN) [001a379a4482] Bootloader: GRUB 2.02~beta3-5+deb9u1
(XEN) [001a3aba1018] Command line: dom0_mem=2048M,max:2048M loglvl=all console_timestamps=datems vga=gfx-1280x1024x32 no-cpuidle com1=38400,8n1 console=vga,com1 ivrs_ioapic[6]=00:14.0 iommu=on,verbose,debug conring_size=128k ucode=scan sched=credit2 gnttab_max_frames=64 reboot=k dom0=pvh
(XEN) [001a494e9056] Xen image load base address: 0
(XEN) [001a4c222d7d] Video information:
(XEN) [001a4e5d7cea]  VGA is graphics mode 1280x1024, 32 bpp
(XEN) [001a51a358a5]  VBE/DDC methods: V2; EDID transfer time: 1 seconds
(XEN) [001a55818d76] Disc information:
(XEN) [001a57b02a5e]  Found 4 MBR signatures
(XEN) [001a5a2aeb63]  Found 4 EDD information structures
(XEN) [001a5d3dfc7d] Xen-e820 RAM map:
(XEN) [001a5f6ca21e]   - 00096400 (usable)
(XEN) [001a62feb77e]  00096400 - 000a (reserved)
(XEN) [001a66aa3075]  000e4000 - 0010 (reserved)
(XEN) [001a6a55a48b]  0010 - c7f9 (usable)
(XEN) [001a6de7b95b]  c7f9 - c7f9e000 (ACPI data)
(XEN) [001a719fd3fb]  c7f9e000 - c7fe (ACPI NVS)
(XEN) [001a754b3972]  c7fe - c800 (reserved)
(XEN) [001a78f6c798]  ffe0 - 0001 (reserved)
(XEN) [001a7ca22738]  0001 - 00053800 (usable)
(XEN) [001a85ab560d] New Xen image base address: 0xc780
(XEN) [001a88e47dba] ACPI: RSDP 000FB100, 0014 (r0 ACPIAM)
(XEN) [001a8c11072b] ACPI: RSDT C7F9, 0048 (r1 MSIOEMSLIC  20100913 MSFT   97)
(XEN) [001a90d3ab62] ACPI: FACP C7F90200, 0084 (r1 7640MS A7640100 20100913 MSFT   97)
(XEN) [001a95967bdd] ACPI: DSDT C7F905E0, 9427 (r1  A7640 A7640100  100 INTL 20051117)
(XEN) [001a9a592c9b] ACPI: FACS C7F9E000, 0040
(XEN) [001a9ced544b] ACPI: APIC C7F90390, 0088 (r1 7640MS A7640100 20100913 MSFT   97)
(XEN) [001aa1b01572] ACPI: MCFG C7F90420, 003C (r1 7640MS OEMMCFG  20100913 MSFT   97)
(XEN) [001aa672eafb] ACPI: SLIC C7F90460, 0176 (r1 MSIOEMSLIC  20100913 MSFT   97)
(XEN) [001aab3591de] ACPI: OEMB C7F9E040, 0072 (r1 7640MS A7640100 20100913 MSFT   97)
(XEN) [001aaff86acb] ACPI: SRAT C7F9A5E0, 0108 (r3 AMDFAM_F_102 AMD 1)
(XEN) [001ab4bb246e] ACPI: HPET C7F9A6F0, 0038 (r1 7640MS OEMHPET  20100913 MSFT   97)
(XEN) [001ab97dea6e] ACPI: IVRS C7F9A730, 0108 (r1  AMD RD890S   202031 AMD 0)
(XEN) [001abe40938e] ACPI: SSDT C7F9A840, 0DA4 (r1 A M I  POWERNOW1 AMD 1)
(XEN) [001ac303719a] System RAM: 20479MB (20970648kB)
(XE

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-02-08 Thread Sander Eikelenboom
On 08/02/2019 17:47, Roger Pau Monné wrote:
> On Fri, Feb 08, 2019 at 05:15:22PM +0100, Sander Eikelenboom wrote:
>> On 08/02/2019 16:10, Roger Pau Monné wrote:
>>> On Fri, Jan 25, 2019 at 07:44:40PM +0100, Sander Eikelenboom wrote:
>>>> On 25/01/2019 15:38, Roger Pau Monné wrote:
>>>>> On Thu, Jan 24, 2019 at 01:04:31PM +0100, Roger Pau Monné wrote:
>>>>> Sorry, fixing that error took longer than expected, and requires
>>>>> modifying quite a lot of code, so I'm not sure whether it's something
>>>>> to consider for 4.12, I have to think about it.
>>>>
>>>> I understand, especially since PVH dom0 is marked as experimental.
>>>>
>>>>> In the meantime, can you please test the following branch:
>>>>>
>>>>> git://xenbits.xen.org/people/royger/xen.git iommu-fixes-v2
>>>>>
>>>>> I've been able to successfully create a PVH guest from a PVH dom0 on
>>>>> AMD hardware using this branch.
>>>>
>>>> On the other hand, with a quick test I can confirm that booting a PVH 
>>>> guest from a PVH dom0 now works for me as well ! 
>>>> (and booting this build as PV dom0, with my normal PVH/HVM mix of guests 
>>>> still works, no regressions for me so far)
>>
>>
>>  
>>> Sorry for bothering you again, but could you give the following branch
>>> a test:
>>
>> No problem, happy to keep testing until it works and is in good enough shape 
>> to get committed.
>>
>> And fortunately you asked, because unfortunately it doesn't boot as pvhdom0, 
>> serial log is attached.
> 
> Thanks!
> 
> Can you try with the following debug patch on top? This should print a
> message before hitting the assert, hopefully giving us more
> information.
> 
> Roger.

Sure, I was also missing a sync_console on the Xen cmdline,
serial log attached.

--
Sander

> ---8<---
> diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
> index 5ad7a36269..bf647e7d26 100644
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -648,8 +648,13 @@ p2m_pt_set_entry(struct p2m_domain *p2m, gfn_t gfn_, 
> mfn_t mfn,
> !rangeset_overlaps_range(mmio_ro_ranges, mfn_x(mfn),
>  mfn_x(mfn) + PFN_DOWN(MB(2;
>  else
> +{
> +if ( !mfn_valid(mfn) && (!mfn_eq(mfn, INVALID_MFN) ||
> + !p2m_allows_invalid_mfn(p2mt)) )
> +printk("mfn: %#lx type: %d\n", mfn_x(mfn), p2mt);
>  ASSERT(mfn_valid(mfn) || (mfn_eq(mfn, INVALID_MFN) &&
>p2m_allows_invalid_mfn(p2mt)));
> +}
>  l2e_content = mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt)
>  ? p2m_l2e_from_pfn(mfn_x(mfn),
> p2m_type_to_flags(p2m, p2mt, mfn, 1))
> 

 __  ___  __ ___  
 \ \/ /___ _ __   | || |  / |___ \  / _ \_ __ ___ 
  \  // _ \ '_ \  | || |_ | | __) || | | |__| '__/ __|
  /  \  __/ | | | |__   _|| |/ __/ | |_| |__| | | (__ 
 /_/\_\___|_| |_||_|(_)_|_(_)___/   |_|  \___|
  
(XEN) [001a849e1840] Xen version 4.12.0-rc (r...@dyndns.org) (gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516) debug=y  Fri Feb  8 20:22:06 CET 2019
(XEN) [001a8c0e5d75] Latest ChangeSet: Fri Feb 8 15:01:58 2019 +0100 git:ed104d12b9-dirty
(XEN) [001a90c468ad] Console output is synchronous.
(XEN) [001a939814a3] Bootloader: GRUB 2.02~beta3-5+deb9u1
(XEN) [001a96b7e8ad] Command line: dom0_mem=2048M,max:2048M loglvl=all console_timestamps=datems vga=gfx-1280x1024x32 no-cpuidle com1=38400,8n1 console=vga,com1 ivrs_ioapic[6]=00:14.0 iommu=on,verbose,debug conring_size=128k ucode=scan sched=credit2 gnttab_max_frames=64 reboot=k sync_console dom0=pvh
(XEN) [001aa5f14b5d] Xen image load base address: 0
(XEN) [001aa8c4f086] Video information:
(XEN) [001aab00504a]  VGA is graphics mode 1280x1024, 32 bpp
(XEN) [001aae463298]  VBE/DDC methods: V2; EDID transfer time: 1 seconds
(XEN) [001ab2245c92] Disc information:
(XEN) [001ab45308a0]  Found 4 MBR signatures
(XEN) [001ab6cdba0d]  Found 4 EDD information structures
(XEN) [001ab9e0e59b] Xen-e820 RAM map:
(XEN) [001abc0f7578]   - 00096400 (usable)
(XEN) [001abfa17a92]  00096400 - 000a (reserved)
(XEN) [001ac34cffae]  000e4000 - 0010 (reserved)
(XEN) [001ac6f865be]  0010 - c7f9 (usable)
(XEN) [001aca8a6ac8]  

[Xen-devel] Linux 5.0 regression: BUG: unable to handle kernel paging request at ffff888023e26778

2019-02-09 Thread Sander Eikelenboom
L.S.,


While testing a Linux 5.0-rc5-ish kernel (pull of yesterday) with some 
additional patches for
already reported other issues i came across the issue below which i haven't 
seen with 4.20.x

I haven't got a reproducer so i might be hard to hit it again, 
system is AMD and this is from the host kernel running under
the Xen hypervisor might it matter.

--

Sander


[17035.016433] BUG: unable to handle kernel paging request at 888023e26778
[17035.025887] #PF error: [PROT] [WRITE]
[17035.035146] PGD 2a2a067 P4D 2a2a067 PUD 2a2b067 PMD 7fe01067 PTE 
801023e26065
[17035.044371] Oops: 0003 [#1] SMP NOPTI
[17035.053720] CPU: 3 PID: 28310 Comm: apt-get Not tainted 
5.0.0-rc5-20190208-thp-net-florian-rtl8169-eric-doflr+ #1
[17035.063440] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[17035.072635] RIP: e030:move_page_tables+0x7c1/0xae0
[17035.081585] Code: ce 00 48 8b 03 31 ff 48 89 44 24 20 e8 9e 72 e4 ff 66 90 
48 89 c6 48 89 df e8 8b 89 e4 ff 66 90 48 8b 44 24 20 b9 0c 00 00 00 <48> 89 45 
00 41 f6 46 52 40 0f 85 3f 02 00 00 49 8b 7e 40 45 31 c0
[17035.100225] RSP: e02b:c9f2bd40 EFLAGS: 00010282
[17035.109208] RAX: 000475e42067 RBX: 888023e267e0 RCX: 000c
[17035.118332] RDX:  RSI:  RDI: 0201
[17035.127378] RBP: 888023e26778 R08:  R09: 00051c1d9000
[17035.136310] R10: deadbeefdeadf00d R11: 88807fc17000 R12: 7fc59fa0
[17035.145433] R13: ea8f89a8 R14: 88801c2286c0 R15: 7fc59f80
[17035.154171] FS:  7fc5a5591100() GS:88807d4c() 
knlGS:
[17035.162730] CS:  e030 DS:  ES:  CR0: 80050033
[17035.171180] CR2: 888023e26778 CR3: 1c3f6000 CR4: 0660
[17035.179545] Call Trace:
[17035.187736]  move_vma.isra.3+0xd1/0x2d0
[17035.195837]  __se_sys_mremap+0x3c6/0x5b0
[17035.203986]  do_syscall_64+0x49/0x100
[17035.212109]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[17035.219971] RIP: 0033:0x7fc5a453527a
[17035.227558] Code: 73 01 c3 48 8b 0d 1e fc 2a 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 19 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d ee fb 2a 00 f7 d8 64 89 01 48
[17035.243255] RSP: 002b:7ffda22d96f8 EFLAGS: 0246 ORIG_RAX: 
0019
[17035.251121] RAX: ffda RBX: 557d40923a30 RCX: 7fc5a453527a
[17035.258986] RDX: 01a0 RSI: 0190 RDI: 7fc59f7ff000
[17035.267127] RBP: 01a0 R08: 0020 R09: 0040
[17035.275259] R10: 0001 R11: 0246 R12: 7fc59f7ff060
[17035.282681] R13: 7fc59f7ff000 R14: 557d40923a30 R15: 557d40829aa0
[17035.290322] Modules linked in:
[17035.297875] CR2: 888023e26778
[17035.305405] ---[ end trace 6ff49f09286816b6 ]---
[17035.313131] RIP: e030:move_page_tables+0x7c1/0xae0
[17035.320326] Code: ce 00 48 8b 03 31 ff 48 89 44 24 20 e8 9e 72 e4 ff 66 90 
48 89 c6 48 89 df e8 8b 89 e4 ff 66 90 48 8b 44 24 20 b9 0c 00 00 00 <48> 89 45 
00 41 f6 46 52 40 0f 85 3f 02 00 00 49 8b 7e 40 45 31 c0
[17035.334851] RSP: e02b:c9f2bd40 EFLAGS: 00010282
[17035.341727] RAX: 000475e42067 RBX: 888023e267e0 RCX: 000c
[17035.348838] RDX:  RSI:  RDI: 0201
[17035.356000] RBP: 888023e26778 R08:  R09: 00051c1d9000
[17035.363623] R10: deadbeefdeadf00d R11: 88807fc17000 R12: 7fc59fa0
[17035.371454] R13: ea8f89a8 R14: 88801c2286c0 R15: 7fc59f80
[17035.378958] FS:  7fc5a5591100() GS:88807d4c() 
knlGS:
[17035.386585] CS:  e030 DS:  ES:  CR0: 80050033
[17035.393797] CR2: 888023e26778 CR3: 1c3f6000 CR4: 0660




___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Linux 5.0 regression: BUG: unable to handle kernel paging request at ffff888023e26778 RIP: e030:move_page_tables+0x7c1/0xae0

2019-02-09 Thread Sander Eikelenboom
On 09/02/2019 19:48, Juergen Gross wrote:
> On 09/02/2019 19:45, Sander Eikelenboom wrote:
>> On 09/02/2019 09:26, Sander Eikelenboom wrote:
>>> L.S.,
>>>
>>>
>>> While testing a Linux 5.0-rc5-ish kernel (pull of yesterday) with some 
>>> additional patches for
>>> already reported other issues i came across the issue below which i haven't 
>>> seen with 4.20.x
>>>
>>> I haven't got a reproducer so i might be hard to hit it again, 
>>> system is AMD and this is from the host kernel running under
>>> the Xen hypervisor might it matter.
>>
>>> --
>>>
>>> Sander
>>
>> Hi Boris / Juergen,
>>
>> The commit causing this is:
>> 2c91bd4a4e2e530582d6fd643ea7b86b27907151 mm: speed up mremap by 20x on large 
>> regions
>>
>> Since it seems there haven't been any other reports about this .. 
>> could it be this doesn't specifically work well with a Xen PVH dom0 ?
> 
> PVH? Not PV?

Ah sorry, indeed PV !

> 
> Juergen
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable PVHdom0: Assertion 'IS_ALIGNED(dfn_x(dfn), (1ul << page_order))' failed at iommu.c:324

2019-02-11 Thread Sander Eikelenboom
On 11/02/2019 14:16, Roger Pau Monné wrote:
> On Fri, Feb 08, 2019 at 08:36:54PM +0100, Sander Eikelenboom wrote:
>> On 08/02/2019 17:47, Roger Pau Monné wrote:
>>> On Fri, Feb 08, 2019 at 05:15:22PM +0100, Sander Eikelenboom wrote:
>>>> On 08/02/2019 16:10, Roger Pau Monné wrote:
>>>>> On Fri, Jan 25, 2019 at 07:44:40PM +0100, Sander Eikelenboom wrote:
>>>>>> On 25/01/2019 15:38, Roger Pau Monné wrote:
>>>>>>> On Thu, Jan 24, 2019 at 01:04:31PM +0100, Roger Pau Monné wrote:
>>>>>>> Sorry, fixing that error took longer than expected, and requires
>>>>>>> modifying quite a lot of code, so I'm not sure whether it's something
>>>>>>> to consider for 4.12, I have to think about it.
>>>>>>
>>>>>> I understand, especially since PVH dom0 is marked as experimental.
>>>>>>
>>>>>>> In the meantime, can you please test the following branch:
>>>>>>>
>>>>>>> git://xenbits.xen.org/people/royger/xen.git iommu-fixes-v2
>>>>>>>
>>>>>>> I've been able to successfully create a PVH guest from a PVH dom0 on
>>>>>>> AMD hardware using this branch.
>>>>>>
>>>>>> On the other hand, with a quick test I can confirm that booting a PVH 
>>>>>> guest from a PVH dom0 now works for me as well ! 
>>>>>> (and booting this build as PV dom0, with my normal PVH/HVM mix of guests 
>>>>>> still works, no regressions for me so far)
>>>>
>>>>
>>>>  
>>>>> Sorry for bothering you again, but could you give the following branch
>>>>> a test:
>>>>
>>>> No problem, happy to keep testing until it works and is in good enough 
>>>> shape to get committed.
>>>>
>>>> And fortunately you asked, because unfortunately it doesn't boot as 
>>>> pvhdom0, serial log is attached.
>>>
>>> Thanks!
>>>
>>> Can you try with the following debug patch on top? This should print a
>>> message before hitting the assert, hopefully giving us more
>>> information.
>>>
>>> Roger.
>>
>> Sure, I was also missing a sync_console on the Xen cmdline,
>> serial log attached.
> 
> Thanks, I've got another branch for you to try:
> 
> git://xenbits.xen.org/people/royger/xen.git fixes-4.12-v2.1
> 
> Roger.
> 

Hi Roger,

That boots as pvh dom0 and pvh guests also boot and work !

Hmm another thing that catched my eye now, is Xen relinquishing the
console to the linux kernel in the early kernel boot phase doesn't seem
to work. The screen stays black between Xen relinquishing the console
until a modesettings graphics card driver gets loaded later on, but you
will miss everything from the kernels early boot phase.
Probably because it is relying on something "legacy" which PVH doesn't
provide ?

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH for-4.12 v2 0/7] pvh/dom0/shadow/amd fixes

2019-02-11 Thread Sander Eikelenboom
On 11/02/2019 18:46, Roger Pau Monne wrote:
> Hello,
> 
> The following series contains fixes that should be considered for 4.12.
> 
> I'm not sure whether patches 5, 6 and 7 should be aimed at 4.12, they
> contain changes to the p2m code that could affect HVM guests. Note that
> without those changes a PVH dom0 running on AMD hardware will be unable
> to create guests. Overall the patches are a nice cleanup to the handling
> of p2m_ioreq_server and p2m_map_foreign types.
> 
> The series can also be found at:
> 
> git://xenbits.xen.org/people/royger/xen.git fixes-4.12-v2.1

I have tested this on AMD hardware, both as PVH dom0 and PV dom0
(running both PVH and HVM guests). So FWIW:

Tested-by: Sander Eikelenboom 

> Roger Pau Monne (7):
>   dom0/pvh: align allocation and mapping order to start address
>   amd/npt/shadow: replace assert that prevents creating 2M/1G MMIO
> entries
>   x86/pvh: reorder PVH dom0 iommu initialization
>   pvh/dom0: warn when dom0_mem is not set
>   x86/mm: split p2m ioreq server pages special handling into helper
>   x86/mm: handle foreign mappings in p2m_entry_modify
>   npt/shadow: allow getting foreign page table entries
> 
>  xen/arch/x86/dom0_build.c   |  10 ++
>  xen/arch/x86/hvm/dom0_build.c   |  37 +---
>  xen/arch/x86/mm/hap/hap.c   |   4 +
>  xen/arch/x86/mm/p2m-ept.c   | 137 ++--
>  xen/arch/x86/mm/p2m-pt.c|  54 ---
>  xen/arch/x86/mm/shadow/common.c |   4 +
>  xen/drivers/passthrough/x86/iommu.c |   7 +-
>  xen/include/asm-x86/p2m.h   |  45 +
>  8 files changed, 135 insertions(+), 163 deletions(-)
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] xen-stable-4.12 branch, few commits short and hasn't got the RELEASE-4.12.0 tag.

2019-04-15 Thread Sander Eikelenboom
Hi Juergen,

Although xen-stable-4.12 is released now for some time (thanks for the timely 
release !), 
the release tag  "RELEASE-4.12.0" is still only available in the "staging-4.12" 
branch of the Xen git tree.
The "stable-4.12" is still a few (release-) commits short, it seems a bit 
awkward.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Xen-unstable + linux 5.4.0-rc8: RIP: 0010:xennet_poll+0x35f/0xae0

2019-11-25 Thread Sander Eikelenboom
L.S.,

At present one of my PVH VM's kernel crashed with the splat below
(haven't seen it before, so could be something that happens sporadically).

Any ideas ?

--
Sander



database databaselogin:  login: [184503.428811] general protection fault:  
[#1] SMP NOPTI
[184503.428887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
5.4.0-rc8-20191123-doflr-mac80211debug+ #1
[184503.428932] RIP: 0010:xennet_poll+0x35f/0xae0
[184503.428955] Code: ba 00 01 00 00 48 8b 8d c0 00 00 00 0f b7 b4 24 92 00 00 
00 48 8b 5c 24 78 3d 00 01 00 00 0f 4e d0 89 55 28 8b 95 bc 00 00 00 <89> 74 11 
3c 48 8b 8d c0 00 00 00 8b 95 bc 00 00 00 89 44 11 38 89
[184503.429027] RSP: 0018:c9003e10 EFLAGS: 00010287
[184503.429049] RAX: 0042 RBX: c9003e88 RCX: 
fffe88800b865a80
[184503.429077] RDX: 0140 RSI:  RDI: 
888005504150
[184503.429107] RBP: 8880f0dfb800 R08: 888103417f00 R09: 
888103010848
[184503.429137] R10:  R11: 82c6e2e8 R12: 
8880f0dfb800
[184503.429167] R13: 0001 R14: c9003ea0 R15: 
8880055022e8
[184503.429202] FS:  () GS:888103c0() 
knlGS:
[184503.429231] CS:  0010 DS:  ES:  CR0: 80050033
[184503.429255] CR2: 7faab4774000 CR3: 000101652000 CR4: 
06f0
[184503.429283] Call Trace:
[184503.429296]  
[184503.429311]  net_rx_action+0x136/0x380
[184503.429331]  __do_softirq+0xda/0x2e0
[184503.429349]  irq_exit+0x9e/0xa0
[184503.429368]  xen_evtchn_do_upcall+0x27/0x40
[184503.429391]  xen_hvm_callback_vector+0xf/0x20
[184503.429413]  
[184503.429425] RIP: 0010:native_safe_halt+0xe/0x10
[184503.429445] Code: 48 8b 04 25 c0 6b 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 
c4 eb 80 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 64 78 60 00 fb f4  90 e9 
07 00 00 00 0f 00 2d 54 78 60 00 f4 c3 90 90 41 55 41 54
[184503.429514] RSP: 0018:82c03e90 EFLAGS: 0246 ORIG_RAX: 
ff0c
[184503.429544] RAX: 0001a548 RBX:  RCX: 
0001
[184503.429573] RDX: 067e991e RSI:  RDI: 
0086
[184503.429601] RBP:  R08: 314d4ab0 R09: 
a7ced08b81fb
[184503.429629] R10: 4400 R11:  R12: 

[184503.429658] R13:  R14: 888103ff0480 R15: 

[184503.429690]  default_idle+0x17/0x140
[184503.429843]  do_idle+0x1f9/0x220
[184503.429864]  cpu_startup_entry+0x14/0x20
[184503.429884]  start_kernel+0x4b6/0x4d8
[184503.429904]  secondary_startup_64+0xa4/0xb0
[184503.429934] Modules linked in:
[184503.430007] ---[ end trace 536ad19f63e35723 ]---
[184503.430032] RIP: 0010:xennet_poll+0x35f/0xae0
[184503.430055] Code: ba 00 01 00 00 48 8b 8d c0 00 00 00 0f b7 b4 24 92 00 00 
00 48 8b 5c 24 78 3d 00 01 00 00 0f 4e d0 89 55 28 8b 95 bc 00 00 00 <89> 74 11 
3c 48 8b 8d c0 00 00 00 8b 95 bc 00 00 00 89 44 11 38 89
[184503.430138] RSP: 0018:c9003e10 EFLAGS: 00010287
[184503.430159] RAX: 0042 RBX: c9003e88 RCX: 
fffe88800b865a80
[184503.430190] RDX: 0140 RSI:  RDI: 
888005504150
[184503.430236] RBP: 8880f0dfb800 R08: 888103417f00 R09: 
888103010848
[184503.430266] R10:  R11: 82c6e2e8 R12: 
8880f0dfb800
[184503.430296] R13: 0001 R14: c9003ea0 R15: 
8880055022e8
[184503.430331] FS:  () GS:888103c0() 
knlGS:
[184503.430361] CS:  0010 DS:  ES:  CR0: 80050033
[184503.430384] CR2: 7faab4774000 CR3: 000101652000 CR4: 
06f0
[184503.430422] Kernel panic - not syncing: Fatal exception in interrupt
[184503.430928] Kernel Offset: disabled

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable + linux 5.4.0-rc8: RIP: 0010:xennet_poll+0x35f/0xae0

2019-11-25 Thread Sander Eikelenboom
On 25/11/2019 15:42, Jan Beulich wrote:
> On 25.11.2019 15:21, Sander Eikelenboom wrote:
>> L.S.,
>>
>> At present one of my PVH VM's kernel crashed with the splat below
>> (haven't seen it before, so could be something that happens sporadically).
>>
>> Any ideas ?
>>
>> --
>> Sander
>>
>>
>>
>> database databaselogin:  login: [184503.428811] general protection fault: 
>>  [#1] SMP NOPTI
>> [184503.428887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>> 5.4.0-rc8-20191123-doflr-mac80211debug+ #1
>> [184503.428932] RIP: 0010:xennet_poll+0x35f/0xae0
>> [184503.428955] Code: ba 00 01 00 00 48 8b 8d c0 00 00 00 0f b7 b4 24 92 00 
>> 00 00 48 8b 5c 24 78 3d 00 01 00 00 0f 4e d0 89 55 28 8b 95 bc 00 00 00 <89> 
>> 74 11 3c 48 8b 8d c0 00 00 00 8b 95 bc 00 00 00 89 44 11 38 89
> 
> The insn here being "mov %esi,(%rcx,%rdx,0x3c)" ...
> 
>> [184503.429027] RSP: 0018:c9003e10 EFLAGS: 00010287
>> [184503.429049] RAX: 0042 RBX: c9003e88 RCX: 
>> fffe88800b865a80
> 
> ... I notice corruption to bit 48 of RCX here. This can be a result of
> memory corruption, but prior instances of such that I had to look into
> were bit flips in the CPU instead. Is this a server or desktop class
> CPU?
> 
> Jan
> 

Desktop (AMD phenom II X6)

I have had some other kernel splats in kernel networking code the last months
that didn't make sense to maintainers and that were written off to "some kind 
of corruption". 

So I will see to schedule a run of memtest86 just to try to rule out
memory going bad.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable + linux 5.4.0-rc8: RIP: 0010:xennet_poll+0x35f/0xae0

2019-11-26 Thread Sander Eikelenboom
On 25/11/2019 15:42, Jan Beulich wrote:
> On 25.11.2019 15:21, Sander Eikelenboom wrote:
>> L.S.,
>>
>> At present one of my PVH VM's kernel crashed with the splat below
>> (haven't seen it before, so could be something that happens sporadically).
>>
>> Any ideas ?
>>
>> --
>> Sander
>>
>>
>>
>> database databaselogin:  login: [184503.428811] general protection fault: 
>>  [#1] SMP NOPTI
>> [184503.428887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>> 5.4.0-rc8-20191123-doflr-mac80211debug+ #1
>> [184503.428932] RIP: 0010:xennet_poll+0x35f/0xae0
>> [184503.428955] Code: ba 00 01 00 00 48 8b 8d c0 00 00 00 0f b7 b4 24 92 00 
>> 00 00 48 8b 5c 24 78 3d 00 01 00 00 0f 4e d0 89 55 28 8b 95 bc 00 00 00 <89> 
>> 74 11 3c 48 8b 8d c0 00 00 00 8b 95 bc 00 00 00 89 44 11 38 89
> 
> The insn here being "mov %esi,(%rcx,%rdx,0x3c)" ...
> 
>> [184503.429027] RSP: 0018:c9003e10 EFLAGS: 00010287
>> [184503.429049] RAX: 0042 RBX: c9003e88 RCX: 
>> fffe88800b865a80
> 
> ... I notice corruption to bit 48 of RCX here. This can be a result of
> memory corruption, but prior instances of such that I had to look into
> were bit flips in the CPU instead. Is this a server or desktop class
> CPU?
> 
> Jan

Hi Jan,

Fortunately (or more unfortunate for me), memtest86 gave errors on one
stick of memory. So this is the probable cause.

Sorry for the noise.

--
Sander


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] xen-unstable (4.14 to be): Assertion '!preempt_count()' failed at preempt.c:36

2019-12-04 Thread Sander Eikelenboom
L.S.,

On current xen-unstable (4.14 to be) and AMD cpu:

After rebooting the host, while the guests are starting, I hit the assertion 
below.
xen-staging-4.13 seems fine on the same machine.

--
Sander


(XEN) [2019-12-04 17:03:25.062] grant_table.c:1808:d7v0 Expanding d7 grant 
table from 3 to 4 frames
(XEN) [2019-12-04 17:03:25.365] Assertion '!preempt_count()' failed at 
preempt.c:36
(XEN) [2019-12-04 17:03:25.365] [ Xen-4.14-unstable  x86_64  debug=y   Not 
tainted ]
(XEN) [2019-12-04 17:03:25.365] CPU:0
(XEN) [2019-12-04 17:03:25.365] RIP:e008:[] 
ASSERT_NOT_IN_ATOMIC+0x46/0x4c
(XEN) [2019-12-04 17:03:25.365] RFLAGS: 00010202   CONTEXT: hypervisor 
(d0v5)
(XEN) [2019-12-04 17:03:25.365] rax: 82d080597020   rbx: 83069fd26000   
rcx: 00a2
(XEN) [2019-12-04 17:03:25.365] rdx:    rsi: 8306b20d38a0   
rdi: 83069fd55b38
(XEN) [2019-12-04 17:03:25.365] rbp: 8300c7c8fee8   rsp: 8300c7c8fee8   
r8:  deadbeefdeadf00d
(XEN) [2019-12-04 17:03:25.365] r9:  deadbeefdeadf00d   r10:    
r11: 
(XEN) [2019-12-04 17:03:25.365] r12:    r13:    
r14: 
(XEN) [2019-12-04 17:03:25.365] r15:    cr0: 80050033   
cr4: 06e0
(XEN) [2019-12-04 17:03:25.365] cr3: 0004630a3000   cr2: 7f602bd4b4f0
(XEN) [2019-12-04 17:03:25.365] fsb: 7f602b08cbc0   gsb: 88807d54   
gss: 
(XEN) [2019-12-04 17:03:25.365] ds:    es:    fs:    gs:    ss: 
e010   cs: e008
(XEN) [2019-12-04 17:03:25.365] Xen code around  
(ASSERT_NOT_IN_ATOMIC+0x46/0x4c):
(XEN) [2019-12-04 17:03:25.365]  58 f6 c4 02 74 06 5d c3 <0f> 0b 0f 0b 0f 0b 55 
48 89 e5 48 8d 05 e6 24 37
(XEN) [2019-12-04 17:03:25.365] Xen stack trace from rsp=8300c7c8fee8:
(XEN) [2019-12-04 17:03:25.365]7cff383700e7 82d080385065 
 7ffe22bf6d90
(XEN) [2019-12-04 17:03:25.365]00305000 7ffe22bf6d90 
88805a2d8700 8880703495f0
(XEN) [2019-12-04 17:03:25.365]0282  
888078435600 
(XEN) [2019-12-04 17:03:25.365] 8100148a 
 
(XEN) [2019-12-04 17:03:25.365]deadbeefdeadf00d 0100 
8100148a e033
(XEN) [2019-12-04 17:03:25.365]0282 c90004dafd88 
e02b 003e6be50048ffe0
(XEN) [2019-12-04 17:03:25.365]003e6d0f00094f03 003e6e03 
003e699d0048ffe0 e010
(XEN) [2019-12-04 17:03:25.365]83069fd26000  
06e0 
(XEN) [2019-12-04 17:03:25.365] 003e0400 
003e78f94ea0
(XEN) [2019-12-04 17:03:25.365] Xen call trace:
(XEN) [2019-12-04 17:03:25.365][] R 
ASSERT_NOT_IN_ATOMIC+0x46/0x4c
(XEN) [2019-12-04 17:03:25.365][] F 
x86_64/entry.S#test_all_events+0x6/0x3d
(XEN) [2019-12-04 17:03:25.365] 
(XEN) [2019-12-04 17:03:26.089] 
(XEN) [2019-12-04 17:03:26.098] 
(XEN) [2019-12-04 17:03:26.117] Panic on CPU 0:
(XEN) [2019-12-04 17:03:26.130] Assertion '!preempt_count()' failed at 
preempt.c:36
(XEN) [2019-12-04 17:03:26.152] 
(XEN) [2019-12-04 17:03:26.171] 
(XEN) [2019-12-04 17:03:26.180] Reboot in five seconds...

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] xen-unstable (4.14 to be): Assertion '!preempt_count()' failed at preempt.c:36

2019-12-04 Thread Sander Eikelenboom
On 04/12/2019 18:30, Jan Beulich wrote:
> On 04.12.2019 18:21, Sander Eikelenboom wrote:
>> On current xen-unstable (4.14 to be) and AMD cpu:
>>
>> After rebooting the host, while the guests are starting, I hit the assertion 
>> below.
>> xen-staging-4.13 seems fine on the same machine.
> 
> Nothing between 4.13 RC4 and the tip of staging stands out,
> so I wonder if you could bisect over this range? Or perhaps
> someone else sees something I don't see (right now).
> 
> Jan

Bisection came up with:

commit cd7dedad8209753e0fc8a97e61d04b74912b53dc
Author: Paul Durrant 
Date:   Fri Nov 15 18:59:30 2019 +

passthrough: simplify locking and logging

Dropping the pcidevs lock between calling device_assigned() and
assign_device() means that the latter has to do the same check as the
former for no obvious gain. Also, since long running operations under
pcidevs lock already drop the lock and return -ERESTART periodically there
is little point in immediately failing an assignment operation with
-ERESTART just because the pcidevs lock could not be acquired (for the
second time, having already blocked on acquiring the lock in
device_assigned()).

This patch instead acquires the lock once for assignment (or test assign)
operations directly in iommu_do_pci_domctl() and thus can remove the
duplicate domain ownership check in assign_device(). Whilst in the
neighbourhood, the patch also removes some debug logging from
assign_device() and deassign_device() and replaces it with proper error
logging, which allows error logging in iommu_do_pci_domctl() to be
removed.

Signed-off-by: Paul Durrant 
Signed-off-by: Igor Druzhinin 
Acked-by: Jan Beulich 



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] passthrough: add missed pcidevs_unlock following c/s cd7dedad820

2019-12-04 Thread Sander Eikelenboom
On 04/12/2019 22:31, Igor Druzhinin wrote:
> The locking responsibilities have changed and a premature break in
> this section now causes the following assertion:
> 
> Assertion '!preempt_count()' failed at preempt.c:36
> 
> Reported-by: Sander Eikelenboom 
> Signed-off-by: Igor Druzhinin 
> ---
>  xen/drivers/passthrough/pci.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index ced0c28..2593fe4 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1705,6 +1705,7 @@ int iommu_do_pci_domctl(
> seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>  ret = -EINVAL;
>  }
> +pcidevs_unlock();
>  break;
>  }
>  else if ( !ret )
> 

Just tested and it works for me, thanks Igor!

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen Security Advisory 311 v4 (CVE-2019-19577) - Bugs in dynamic height handling for AMD IOMMU pagetables

2019-12-11 Thread Sander Eikelenboom
On 11/12/2019 13:09, Xen.org security team wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Xen Security Advisory CVE-2019-19577 / XSA-311
>version 4
> 
>  Bugs in dynamic height handling for AMD IOMMU pagetables
> 

> 
> CREDITS
> ===
> 
> This issue was discovered by Sander Eikelenboom, along with Andrew Cooper of
> Citrix.

Ahh this was why Jan's two patches were skipped, I was about to inquire
if it would be picked up in the future in some form.

--
Sander


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] IOMMU: make DMA containment of quarantined devices optional

2019-12-16 Thread Sander Eikelenboom
On 16/12/2019 08:24, Jürgen Groß wrote:
> On 16.12.19 06:58, Tian, Kevin wrote:
>>> From: Jürgen Groß 
>>> Sent: Friday, December 13, 2019 11:36 PM
>>>
>>> On 13.12.19 15:45, Jan Beulich wrote:
 On 13.12.2019 15:24, Jürgen Groß wrote:
> On 13.12.19 15:11, Jan Beulich wrote:
>> On 13.12.2019 14:46, Jürgen Groß wrote:
>>> On 13.12.19 14:38, Jan Beulich wrote:
 On 13.12.2019 14:31, Jürgen Groß wrote:
> Maybe I have misunderstood the current state, but I thought that it
> would just silently hide quirky devices without imposing a security
> risk. We would not learn which devices are quirky, but OTOH I doubt
> we'd get many reports about those in case your patch goes in.

 We don't want or need such reports, that's not the point. The
 security risk comes from the quirkiness of the devices - admins
 may wrongly think all is well and expose quirky devices to not
 sufficiently trusted guests. (I say this fully realizing that
 exposing devices to untrusted guests is almost always a certain
 level of risk.)
>>>
>>> Do we _know_ those devices are problematic from security standpoint?
>>> Normally the IOMMU should do the isolation just fine. If it doesn't
>>> then its not the quirky device which is problematic, but the IOMMU.
>>>
>>> I thought the problem was that the quirky devices would not stop all
>>> (read) DMA even when being unassigned from the guest resulting in
>>> fatal IOMMU faults. The dummy page should stop those faults to
>>> happen
>>> resulting in a more stable system.
>>
>> IOMMU faults by themselves are not impacting stability (they will
>> add processing overhead, yes). The problem, according to Paul's
>> description, is that the occurrence of at least some forms of IOMMU
>> faults (not present ones as it seems, as opposed to permission
>> violation ones) is fatal to certain systems. Irrespective of the
>> sink page used after de-assignment a guest can arrange for IOMMU
>> faults to occur even while it still has the device assigned. Hence
>> it is important for the admin to know that their system (not the
>> the particular device) behaves in this undesirable way.
>
> So how does the admin learn this? Its not as if your patch would result
> in a system crash or hang all the time, right? This would be the case
> only if there either is a malicious (on purpose or due to a bug) guest
> which gets the device assigned, or if there happens to be a pending DMA
> operation when the device gets unassigned.

 I didn't claim the change would cover all cases. All I am claiming
 is that it increases the chances of admins becoming aware of reasons
 not to pass through devices to certain guests.
>>>
>>> So combined with your answer this means to me:
>>>
>>> With your patch (or the original one reverted) a DoS will occur either
>>> due to a malicious guest or in case a DMA is still pending. As a result
>>> the admin will no longer pass this device to any untrusted guest.
>>>
>>> With the current 4.13-staging a DoS will occur only due to a malicious
>>> guest. The admin will then no longer pass this device to any untrusted
>>> guest.
>>>
>>> So right now without any untrusted guest no DoS, while possibly DoS with
>>> your patch. How is that better?
>>>
>>
>> I'd suggest separating run-time DoS from original quarantine purpose
>> of this patch.
>>
>> For quarantine, I'm with Jan that giving admin the chance of knowing
>> whether quarantine is required is important. Say an admin just gets
>> a sample device from a new vendor and needs to decide whether his
>> employer should put such device in their production system. It's
>> essential to have a default configuration which can warn on any
>> possible violation of the expectations on a good assignable device.
>> Then the admin can look at Xen user guide to find out what the warning
>> information means and then does whatever required (usually means
>> more scrutinization than the warning itself) to figure out whether
>> identified problems are safe (e.g. by enabling quarantine) or are
>> real indicators of bogus implementation (then should not use it).
>> Having quarantine default on means that every admin should remember
>> that Xen already enables some band-aids on benign warnings so he
>> should explicitly turn off those options to do evaluation which, to me
>> is not realistic.
> 
> This implies the admin is aware of the necessity to do that testing.
> And for the tests to be conclusive he probably needs to do more than
> just a basic "does it work" test, as the pending DMA might occur in
> some cases only. And that is basically my problem with Jan's default:
> an admin not doing enough testing (or non at all) will end up with a
> DoS situation in production, while an admin knowing that he needs to
> test properly is probably more aware of th

Re: [Xen-devel] [ANNOUNCEMENT] Xen 4.13 is released

2019-12-18 Thread Sander Eikelenboom
On 18/12/2019 18:00, Juergen Gross wrote:
> Dear community members,
> 
> I'm pleased to announce that Xen 4.13.0 is released.
>  
> Thanks everyone who contributed to this release. This release would
> not have happened without all the awesome contributions from around
> the globe.
> 
> Regards,
> 
> Juergen Gross (on behalf of the Xen Project Hypervisor team)

Thanks for your work as release manager !

--
Sander


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached

2019-10-10 Thread Sander Eikelenboom
On 01/10/2019 12:35, Anthony PERARD wrote:
> Rewrite of the commit message:
> 
> Before the problematic commit, libxl used to ignore error when
> destroying (force == true) a passthrough device, especially error that
> happens when dealing with the DM.
> 
> Since fae4880c45fe, if the DM failed to detach the pci device within
> the allowed time, the timed out error raised skip part of
> pci_remove_*, but also raise the error up to the caller of
> libxl__device_pci_destroy_all, libxl__destroy_domid, and thus the
> destruction of the domain fails.
> 
> In this patch, if the DM didn't confirmed that the device is removed,
> we will print a warning and keep going if force=true.  The patch
> reorder the functions so that pci_remove_timeout() calls
> pci_remove_detatched() like it's done when DM calls are successful.
> 
> We also clean the QMP states and associated timeouts earlier, as soon
> as they are not needed anymore.
> 
> Reported-by: Sander Eikelenboom 
> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b
> Signed-off-by: Anthony PERARD 
> 

Hi Anthony / Chao,

I have to come back to this, a bit because perhaps there is an underlying issue.
While it earlier occurred to me that the VM to which I passed through most 
pci-devices 
(8 to be exact) became very slow to shutdown, but I  didn't investigate it 
further.

But after you commit messages from this patch it kept nagging, so today I did 
some testing
and bisecting.

The difference in tear-down time at least from what the IOMMU code logs is 
quite large:

xen-4.12.0
Setup:  7.452 s
Tear-down:  7.626 s

xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d
Setup:  7.468 s
Tear-down: 50.239 s

Bisection turned up:
commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40
Author: Chao Gao 
Date:   Fri Jul 19 10:24:08 2019 +0100
libxl_qmp: wait for completion of device removal

Which makes me wonder if there is something going wrong in Qemu ?

--
Sander



xen-4.12.0 setup:
(XEN) [2019-10-10 09:54:14.846] AMD-Vi: Disable: device id = 0x900, 
domain = 0, paging mode = 3
(XEN) [2019-10-10 09:54:14.846] AMD-Vi: Setup I/O page table: device id 
= 0x900, type = 0x1, root table = 0x4aa847000, domain = 1, paging mode = 3
(XEN) [2019-10-10 09:54:14.846] AMD-Vi: Re-assign :09:00.0 from 
dom0 to dom1
...
(XEN) [2019-10-10 09:54:22.298] AMD-Vi: Disable: device id = 0x907, 
domain = 0, paging mode = 3
(XEN) [2019-10-10 09:54:22.298] AMD-Vi: Setup I/O page table: device id 
= 0x907, type = 0x1, root table = 0x4aa847000, domain = 1, paging mode = 3
(XEN) [2019-10-10 09:54:22.298] AMD-Vi: Re-assign :09:00.7 from 
dom0 to dom1


xen-4.12.0 tear-down:
(XEN) [2019-10-10 10:01:11.971] AMD-Vi: Disable: device id = 0x900, 
domain = 1, paging mode = 3
(XEN) [2019-10-10 10:01:11.971] AMD-Vi: Setup I/O page table: device id 
= 0x900, type = 0x1, root table = 0x53572c000, domain = 0, paging mode = 3
(XEN) [2019-10-10 10:01:11.971] AMD-Vi: Re-assign :09:00.0 from 
dom1 to dom0
...
(XEN) [2019-10-10 10:01:19.597] AMD-Vi: Disable: device id = 0x907, 
domain = 1, paging mode = 3
(XEN) [2019-10-10 10:01:19.597] AMD-Vi: Setup I/O page table: device id 
= 0x907, type = 0x1, root table = 0x53572c000, domain = 0, paging mode = 3
(XEN) [2019-10-10 10:01:19.597] AMD-Vi: Re-assign :09:00.7 from 
dom1 to dom0

xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d setup:
(XEN) [2019-10-10 10:21:38.549] d1: bind: m_gsi=47 g_gsi=36 dev=00.00.5 
intx=0
(XEN) [2019-10-10 10:21:38.621] AMD-Vi: Disable: device id = 0x900, 
domain = 0, paging mode = 3
(XEN) [2019-10-10 10:21:38.621] AMD-Vi: Setup I/O page table: device id 
= 0x900, type = 0x1, root table = 0x4aa83b000, domain = 1, paging mode = 3
(XEN) [2019-10-10 10:21:38.621] AMD-Vi: Re-assign :09:00.0 from 
dom0 to dom1
...
(XEN) [2019-10-10 10:21:46.069] d1: bind: m_gsi=46 g_gsi=36 dev=00.01.4 
intx=3
(XEN) [2019-10-10 10:21:46.089] AMD-Vi: Disable: device id = 0x907, 
domain = 0, paging mode = 3
(XEN) [2019-10-10 10:21:46.089] AMD-Vi: Setup I/O page table: device id 
= 0x907, type = 0x1, root table = 0x4aa83b000, domain = 1, paging mode = 3
(XEN) [2019-10-10 10:21:46.089] AMD-Vi: Re-assign :09:00.7 from 
dom0 to dom1


xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d tear-down:
(XEN) [2019-10-10 10:23:53.167] AMD-Vi: Disable: device id = 0x900, 
domain = 1, paging mode = 3
(XEN) [2019-10-10 10:23:53.167] AMD-Vi: Setup I/O page table: device id 
= 0x900, type = 0x1, root table = 0x5240f8000, domain = 0, paging mode = 3
(XEN) [2019-10-10 10:23:53.167] AMD-Vi: Re-assign :09:00.0 from 
dom1 to dom0
...
(XEN) [2019-10-10 10:24:43.406] AMD-Vi: Disable: devi

[Xen-devel] Xen-unstable 4.13.0-rc0 problem starting guest while trying to passthrough multiple pci devices

2019-10-15 Thread Sander Eikelenboom
Hi Anthony,

While testing xen-unstable 4.13.0-rc0 I ran in to the following issue:

When passing through all 8 functions of a pci(e) device I can't start the guest 
anymore, note that the trouble only starts at 0:9:0.3, not at 0:9:0.0:
libxl: error: libxl_qmp.c:1270:qmp_ev_connect: Domain 3:Failed to 
connect to QMP socket /var/run/xen/qmp-libxl-3: Resource temporarily unavailable
libxl: error: libxl_pci.c:1702:device_pci_add_done: Domain 
3:libxl__device_pci_add  failed for PCI device 0:9:0.3 (rc -3)
libxl: error: libxl_qmp.c:1270:qmp_ev_connect: Domain 3:Failed to 
connect to QMP socket /var/run/xen/qmp-libxl-3: Resource temporarily unavailable
libxl: error: libxl_pci.c:1702:device_pci_add_done: Domain 
3:libxl__device_pci_add  failed for PCI device 0:9:0.4 (rc -3)
libxl: error: libxl_qmp.c:1270:qmp_ev_connect: Domain 3:Failed to 
connect to QMP socket /var/run/xen/qmp-libxl-3: Resource temporarily unavailable
libxl: error: libxl_pci.c:1702:device_pci_add_done: Domain 
3:libxl__device_pci_add  failed for PCI device 0:9:0.5 (rc -3)
libxl: error: libxl_qmp.c:1270:qmp_ev_connect: Domain 3:Failed to 
connect to QMP socket /var/run/xen/qmp-libxl-3: Resource temporarily unavailable
libxl: error: libxl_pci.c:1702:device_pci_add_done: Domain 
3:libxl__device_pci_add  failed for PCI device 0:9:0.6 (rc -3)
libxl: error: libxl_qmp.c:1270:qmp_ev_connect: Domain 3:Failed to 
connect to QMP socket /var/run/xen/qmp-libxl-3: Resource temporarily unavailable
libxl: error: libxl_pci.c:1702:device_pci_add_done: Domain 
3:libxl__device_pci_add  failed for PCI device 0:9:0.7 (rc -3)
libxl: error: libxl_create.c:1609:domcreate_attach_devices: Domain 
3:unable to add pci devices
libxl: error: libxl_domain.c:1177:libxl__destroy_domid: Domain 
3:Non-existant domain
libxl: error: libxl_domain.c:1131:domain_destroy_callback: Domain 
3:Unable to destroy guest
libxl: error: libxl_domain.c:1058:domain_destroy_cb: Domain 
3:Destruction of domain failed

When only passing through the first functions 0:9:0.0 to 0:9:0.2, the guest 
starts.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached

2019-10-15 Thread Sander Eikelenboom
On 14/10/2019 17:03, Chao Gao wrote:
> On Thu, Oct 10, 2019 at 06:13:43PM +0200, Sander Eikelenboom wrote:
>> On 01/10/2019 12:35, Anthony PERARD wrote:
>>> Rewrite of the commit message:
>>>
>>> Before the problematic commit, libxl used to ignore error when
>>> destroying (force == true) a passthrough device, especially error that
>>> happens when dealing with the DM.
>>>
>>> Since fae4880c45fe, if the DM failed to detach the pci device within
>>> the allowed time, the timed out error raised skip part of
>>> pci_remove_*, but also raise the error up to the caller of
>>> libxl__device_pci_destroy_all, libxl__destroy_domid, and thus the
>>> destruction of the domain fails.
>>>
>>> In this patch, if the DM didn't confirmed that the device is removed,
>>> we will print a warning and keep going if force=true.  The patch
>>> reorder the functions so that pci_remove_timeout() calls
>>> pci_remove_detatched() like it's done when DM calls are successful.
>>>
>>> We also clean the QMP states and associated timeouts earlier, as soon
>>> as they are not needed anymore.
>>>
>>> Reported-by: Sander Eikelenboom 
>>> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b
>>> Signed-off-by: Anthony PERARD 
>>>
>>
>> Hi Anthony / Chao,
>>
>> I have to come back to this, a bit because perhaps there is an underlying 
>> issue.
>> While it earlier occurred to me that the VM to which I passed through most 
>> pci-devices 
>> (8 to be exact) became very slow to shutdown, but I  didn't investigate it 
>> further.
>>
>> But after you commit messages from this patch it kept nagging, so today I 
>> did some testing
>> and bisecting.
>>
>> The difference in tear-down time at least from what the IOMMU code logs is 
>> quite large:
>>
>> xen-4.12.0
>>  Setup:  7.452 s
>>  Tear-down:  7.626 s
>>
>> xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d
>>  Setup:  7.468 s
>>  Tear-down: 50.239 s
>>
>> Bisection turned up:
>>  commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40
>>  Author: Chao Gao 
>>  Date:   Fri Jul 19 10:24:08 2019 +0100
>>  libxl_qmp: wait for completion of device removal
>>
>> Which makes me wonder if there is something going wrong in Qemu ?
 
> Hi Sander,
Hi Chao,

> 
> Thanks for your testing and the bisection.
> 
> I tried on my machine, the destruction time of a guest with 8 pass-thru
> devices increased from 4s to 12s after applied the commit above.

To what patch are you referring Anthony's or 
c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 ?

> In my understanding, I guess you might get the error message "timed out
> waiting for DM to remove...". There might be some issues on your assigned
> devices' drivers. You can first unbind the devices with their drivers in
> VM and then tear down the VM, and check whether the VM teardown gets
> much faster.

I get that error message when I test with Anthony's patch applied, the 
destruction time with that patch is low.

How ever my point was if that patch is correct in the sense that there seems to 
be an underlying issue 
which causes it to take so long. That issue was uncovered by 
c4b1ef0f89aa6a74faa4618ce3efed1de246ec40, so I'm not
saying that commit is wrong in any sense, it just uncovered another issue that 
was already present,
but hard to detect as we just didn't wait at destruction time (and thus the 
same effect as a timeout).

One or the other way that was just a minor issue until 
fae4880c45fe015e567afa223f78bf17a6d98e1b, where the long
destruction time now caused the domain destruction to stall, which was then 
fixed by Antony's patch, but that uses
a timeout which kinds of circumvents the issue, instead of finding out where is 
comes from and solve it there (
if that is possible of course).

And I wonder if Anthony's patch doesn't interfere with the case you made 
c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 for, 
if you get the timeout error message as well, then that is kind of not waiting 
for the destruction to finish, isn't it ?

Chao, 
could you perhaps test for me Xen with as latest commit 
ee7170822f1fc209f33feb47b268bab35541351d ?
That is before Anthony's patch series, but after your 
c4b1ef0f89aa6a74faa4618ce3efed1de246ec40.

I would expect to see longer destruction times in the case of 8 pass-throuh 
devices as well.

Unfortunately Qemu doesn't seem to do much verbose logging even when i enable 
the debug defines in hw/xen,
especially for the destruction side of things (it mostly logs setting up stuff).

Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached

2019-10-15 Thread Sander Eikelenboom
On 15/10/2019 18:59, Sander Eikelenboom wrote:
> On 14/10/2019 17:03, Chao Gao wrote:
>> On Thu, Oct 10, 2019 at 06:13:43PM +0200, Sander Eikelenboom wrote:
>>> On 01/10/2019 12:35, Anthony PERARD wrote:
>>>> Rewrite of the commit message:
>>>>
>>>> Before the problematic commit, libxl used to ignore error when
>>>> destroying (force == true) a passthrough device, especially error that
>>>> happens when dealing with the DM.
>>>>
>>>> Since fae4880c45fe, if the DM failed to detach the pci device within
>>>> the allowed time, the timed out error raised skip part of
>>>> pci_remove_*, but also raise the error up to the caller of
>>>> libxl__device_pci_destroy_all, libxl__destroy_domid, and thus the
>>>> destruction of the domain fails.
>>>>
>>>> In this patch, if the DM didn't confirmed that the device is removed,
>>>> we will print a warning and keep going if force=true.  The patch
>>>> reorder the functions so that pci_remove_timeout() calls
>>>> pci_remove_detatched() like it's done when DM calls are successful.
>>>>
>>>> We also clean the QMP states and associated timeouts earlier, as soon
>>>> as they are not needed anymore.
>>>>
>>>> Reported-by: Sander Eikelenboom 
>>>> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b
>>>> Signed-off-by: Anthony PERARD 
>>>>
>>>
>>> Hi Anthony / Chao,
>>>
>>> I have to come back to this, a bit because perhaps there is an underlying 
>>> issue.
>>> While it earlier occurred to me that the VM to which I passed through most 
>>> pci-devices 
>>> (8 to be exact) became very slow to shutdown, but I  didn't investigate it 
>>> further.
>>>
>>> But after you commit messages from this patch it kept nagging, so today I 
>>> did some testing
>>> and bisecting.
>>>
>>> The difference in tear-down time at least from what the IOMMU code logs is 
>>> quite large:
>>>
>>> xen-4.12.0
>>> Setup:  7.452 s
>>> Tear-down:  7.626 s
>>>
>>> xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d
>>> Setup:  7.468 s
>>> Tear-down: 50.239 s
>>>
>>> Bisection turned up:
>>> commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40
>>> Author: Chao Gao 
>>> Date:   Fri Jul 19 10:24:08 2019 +0100
>>> libxl_qmp: wait for completion of device removal
>>>
>>> Which makes me wonder if there is something going wrong in Qemu ?
>  
>> Hi Sander,
> Hi Chao,
> 
>>
>> Thanks for your testing and the bisection.
>>
>> I tried on my machine, the destruction time of a guest with 8 pass-thru
>> devices increased from 4s to 12s after applied the commit above.
> 
> To what patch are you referring Anthony's or 
> c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 ?
> 
>> In my understanding, I guess you might get the error message "timed out
>> waiting for DM to remove...". There might be some issues on your assigned
>> devices' drivers. You can first unbind the devices with their drivers in
>> VM and then tear down the VM, and check whether the VM teardown gets
>> much faster.

Sorry I forgot to answer your question, I tried unbinding the drivers in
the guest prior to shutting it down, but it didn't make any difference.

--
Sander


> I get that error message when I test with Anthony's patch applied, the 
> destruction time with that patch is low.
> 
> How ever my point was if that patch is correct in the sense that there seems 
> to be an underlying issue 
> which causes it to take so long. That issue was uncovered by 
> c4b1ef0f89aa6a74faa4618ce3efed1de246ec40, so I'm not
> saying that commit is wrong in any sense, it just uncovered another issue 
> that was already present,
> but hard to detect as we just didn't wait at destruction time (and thus the 
> same effect as a timeout).
> 
> One or the other way that was just a minor issue until 
> fae4880c45fe015e567afa223f78bf17a6d98e1b, where the long
> destruction time now caused the domain destruction to stall, which was then 
> fixed by Antony's patch, but that uses
> a timeout which kinds of circumvents the issue, instead of finding out where 
> is comes from and solve it there (
> if that is possible of course).
> 
> And I wonder if Anthony's patch doesn't interfere with the case you made 
> c4b1ef0f89aa6a74faa4618ce3efed1de246ec40 for

Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached

2019-10-18 Thread Sander Eikelenboom
On 18/10/2019 18:11, Anthony PERARD wrote:
> On Mon, Oct 14, 2019 at 11:03:43PM +0800, Chao Gao wrote:
>> On Thu, Oct 10, 2019 at 06:13:43PM +0200, Sander Eikelenboom wrote:
>>> Hi Anthony / Chao,
>>>
>>> I have to come back to this, a bit because perhaps there is an underlying 
>>> issue.
>>> While it earlier occurred to me that the VM to which I passed through most 
>>> pci-devices 
>>> (8 to be exact) became very slow to shutdown, but I  didn't investigate it 
>>> further.
>>>
>>> But after you commit messages from this patch it kept nagging, so today I 
>>> did some testing
>>> and bisecting.
>>>
>>> The difference in tear-down time at least from what the IOMMU code logs is 
>>> quite large:
>>>
>>> xen-4.12.0
>>> Setup:  7.452 s
>>> Tear-down:  7.626 s
>>>
>>> xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d
>>> Setup:  7.468 s
>>> Tear-down: 50.239 s
>>>
>>> Bisection turned up:
>>> commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40
>>> Author: Chao Gao 
>>> Date:   Fri Jul 19 10:24:08 2019 +0100
>>> libxl_qmp: wait for completion of device removal
>>>
>>> Which makes me wonder if there is something going wrong in Qemu ?
>>
>> Hi Sander,
>>
>> Thanks for your testing and the bisection.
>>
>> I tried on my machine, the destruction time of a guest with 8 pass-thru
>> devices increased from 4s to 12s after applied the commit above. In my
>> understanding, I guess you might get the error message "timed out
>> waiting for DM to remove...". There might be some issues on your assigned
>> devices' drivers. You can first unbind the devices with their drivers in
>> VM and then tear down the VM, and check whether the VM teardown gets
>> much faster.
> 
> Hi,
> 
> Chao, I think you've tested `xl destroy`, and Sander, I think your are
> speaking about `xl shutdown` or simply power off of a guest. Well, these
> two operations are a bit different, on destroy the guest kernel is
> still running when the pci devices are been removed, but on shutdown the
> guest kernel is gone.
> 
> I don't think there's anything wrong with QEMU or with the devices, it
> just that when the toolstack ask QEMU to unplug the pci device, QEMU
> will ask the guest kernel first. So the guest may never acknowledge the
> removal and QEMU will not let go of the pci device. There is actually an
> old Xen commit about that:
> 77fea72b068d25afb7e63947aba32b487d7124a2, and a comment in the code:
> /* This depends on guest operating system acknowledging the
>  * SCI, if it doesn't respond in time then we may wish to
>  * force the removal. */

Hi Anthony,

Correct I was referring to the behavior with "xl shutdown".

The above is interesting, my follow up question would be:
Does Qemu know / keep a state when the guest kernel is gone ?
Because if it does, we could make the need to ackknowledge removal be
conditional on that ?

--
Sander


>> Anthony & Wei,
>>
>> The commit above basically serializes and synchronizes detaching
>> assigned devices and thus increases VM teardown time significantly if
>> there are multiple assigned devices. The commit aimed to avoid qemu's
>> access to PCI configuration space coinciding with the device reset
>> initiated by xl (which is not desired and is exactly the case which
>> triggers the assertion in Xen [1]). I personally insist that xl should
>> wait for DM's completion of device detaching. Otherwise, besides Xen
>> panic (which can be fixed in another way), in theory, such sudden
>> unawared device reset might cause a disaster (e.g. data loss for a
>> storage device).
>>
>> [1]: 
>> https://lists.xenproject.org/archives/html/xen-devel/2019-09/msg03287.html
>>
>> But considering fast creation and teardown is an important benefit of
>> virtualization, I am not sure how to deal with the situation. Anyway,
>> you can make the decision. To fix the regression on VM teardown, we can
>> revert the commit by removing the timeout logic.
>>
>> What's your opinion?
> 
> It probably a good idea to wait a bit until QEMU has detach the device.
> For cases where QEMU will never detach the device (the guest kernel is
> shutdown), we could reduce the timeout. Following my changes to pci
> passthrough handling in libxl, the timeout is 10s for one device (and
> probably 10s for many; I don't think libxl will even ask qemu to remove
> the other devices if the first one timeout).
> 
> So, maybe we could wait for 5s for QEMU to detach the pci device? As
> past that time, it will probably never happen. This still mean about +5s
> to tear-down compare to previous releases. (Or maybe +5s per device if
> we have to do one device at a time.)
> 
> There are other issues with handling multiple pci passthrough devices,
> so I don't have patches yet.
> 
> Cheers,
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC XEN PATCH for-4.13 0/4] Fix: libxl workaround, multiple connection to single QMP socket

2019-10-26 Thread Sander Eikelenboom
On 25/10/2019 19:05, Anthony PERARD wrote:
> Patch series available in this git branch:
> https://xenbits.xen.org/git-http/people/aperard/xen-unstable.git 
> br.fix-ev_qmp-multi-connect-v1
> 
> Hi,
> 
> QEMU's QMP socket doesn't allow multiple concurrent connection. Also, it
> listen() on the socket with a `backlog' of only 1. On Linux at least, once 
> that
> backlog is filled connect() will return EAGAIN if the socket fd is
> non-blocking. libxl may attempt many concurrent connect() attempt if for
> example a guest is started with several PCI passthrough devices, and a
> connect() failure lead to a failure to start the guest.

Hi Anthony,

Just tested with the patch series and it fixes my issue with starting a
guest with several PCI passthrough devices.

Thanks,

Sander


> Since we can't change the listen()'s `backlog' that QEMU use, we need other
> ways to workaround the issue. This patch series introduce a lock to acquire
> before attempting to connect() to the QMP socket. Since the lock might be held
> for to long, the series also introduce a way to cancel the acquisition of the
> lock, this means killing the process that tries to get the lock.
> 
> Alternatively to this craziness, it might be possible to increase the 
> `backlog'
> value by having libxl opening the QMP socket on behalf of QEMU. But this is
> only possible with a recent version of QEMU (2.12 or newer, released in Apr
> 2018, or qemu-xen-4.12 or newer). It would involve to discover QEMU's
> capability before we start the DM, which libxl isn't capable yet.
> 
> Cheers,
> 
> Anthony PERARD (4):
>   libxl: Introduce libxl__ev_child_kill
>   libxl: Introduce libxl__ev_qmplock
>   libxl: libxl__ev_qmp_send now takes an egc
>   libxl_qmp: Have a lock for QMP socket access
> 
>  tools/libxl/libxl_disk.c|  6 +--
>  tools/libxl/libxl_dm.c  |  8 ++--
>  tools/libxl/libxl_dom_save.c|  2 +-
>  tools/libxl/libxl_dom_suspend.c |  2 +-
>  tools/libxl/libxl_domain.c  |  8 ++--
>  tools/libxl/libxl_event.c   |  3 +-
>  tools/libxl/libxl_fork.c| 55 
>  tools/libxl/libxl_internal.c| 31 +-
>  tools/libxl/libxl_internal.h| 53 +--
>  tools/libxl/libxl_pci.c |  8 ++--
>  tools/libxl/libxl_qmp.c | 75 +++++
>  tools/libxl/libxl_usb.c | 28 ++--
>  12 files changed, 219 insertions(+), 60 deletions(-)
> 


-- 

Met vriendelijke groet,

Sander Eikelenboom
mailto:san...@eikelenboom.it

Eikelenboom IT Services
Kaapseweg 70
5642 HK Eindhoven
M: 06-14387484

PGP public key for san...@eikelenboom.it:
key id: 0xC4B99EEDECF2AE69
fingerprint: 07BB B819 FF93 E54D 5F5C  0BDE C4B9 9EED ECF2 AE69

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Xen-unstable: AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.

2019-10-28 Thread Sander Eikelenboom
Hi Jan / Andrew,

While testing the latest xen-unstable and starting an HVM guest with 
pci-passtrough on my AMD machine,
my eye catched the following messages in xl dmesg I haven't seen before:

(XEN) [2019-10-28 10:23:16.372] AMD-Vi: update_paging_mode Try to access 
pdev_list without aquiring pcidevs_lock.
(XEN) [2019-10-28 10:24:08.136] AMD-Vi: INVALID_DEV_REQUEST 0800 8a00 
f8000240 00fd

Probably something from the AMD iommu rework that got committed lately ?
I you need a complete debug log from host boot or want me to run some debug 
patches,
please let me know.

--
Sander




(XEN) [2019-10-28 10:23:16.372] AMD-Vi: update_paging_mode Try to access 
pdev_list without aquiring pcidevs_lock.
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: CPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: CPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: CPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: CPU
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: PIC
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: IOAPIC
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: LAPIC
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: LAPIC
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: LAPIC
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: LAPIC
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: LAPIC_REGS
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: LAPIC_REGS
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: LAPIC_REGS
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: LAPIC_REGS
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: PCI_IRQ
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: ISA_IRQ
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: PCI_LINK
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: PIT
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: RTC
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: HPET
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: PMTIMER
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: MTRR
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: MTRR
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: MTRR
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: MTRR
(XEN) [2019-10-28 10:23:16.605] HVM d18 save: VIRIDIAN_DOMAIN
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: CPU_XSAVE
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: CPU_XSAVE
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: CPU_XSAVE
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: CPU_XSAVE
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: VIRIDIAN_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: VIRIDIAN_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: VIRIDIAN_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: VIRIDIAN_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: VMCE_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: VMCE_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: VMCE_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: VMCE_VCPU
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: TSC_ADJUST
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: TSC_ADJUST
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: TSC_ADJUST
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: TSC_ADJUST
(XEN) [2019-10-28 10:23:16.605] HVM d18v0 save: CPU_MSR
(XEN) [2019-10-28 10:23:16.605] HVM d18v1 save: CPU_MSR
(XEN) [2019-10-28 10:23:16.605] HVM d18v2 save: CPU_MSR
(XEN) [2019-10-28 10:23:16.605] HVM d18v3 save: CPU_MSR
(XEN) [2019-10-28 10:23:16.605] HVM18 restore: CPU 0
(XEN) [2019-10-28 10:23:21.950] d18: bind: m_gsi=37 g_gsi=36 dev=00.00.5 intx=0
(XEN) [2019-10-28 10:23:21.976] AMD-Vi: Disable: device id = 0x800, domain = 0, 
paging mode = 3
(XEN) [2019-10-28 10:23:21.976] AMD-Vi: Setup I/O page table: device id = 
0x800, type = 0x1, root table = 0x40f3c1000, domain = 18, paging mode = 3
(XEN) [2019-10-28 10:23:21.976] AMD-Vi: Re-assign :08:00.0 from dom0 to 
dom18
(d18) [2019-10-28 10:23:22.030] HVM Loader
(d18) [2019-10-28 10:23:22.030] Detected Xen v4.13.0-rc
(d18) [2019-10-28 10:23:22.030] Xenbus rings @0xfeffc000, event channel 1
(d18) [2019-10-28 10:23:22.030] System requested SeaBIOS
(d18) [2019-10-28 10:23:22.030] CPU speed is 3200 MHz
(d18) [2019-10-28 10:23:22.030] Relocating guest memory for lowmem MMIO space 
disabled
(XEN) [2019-10-28 10:23:22.039] irq.c:374: Dom18 PCI link 0 changed 0 -> 5
(d18) [2019-10-28 10:23:22.039] PCI-ISA link 0 routed to IRQ5
(XEN) [2019-10-28 10:23:22.048] irq.c:374: Dom18 PCI link 1 changed 0 -> 10
(d18) [2019-10-28 10:23:22.048] PCI-ISA link 1 routed to IRQ10
(XEN) [2019-10-28 10:23:22.056] irq.c:374: Dom18 PCI link 2 changed 0 -> 11
(d18) [2019-10-28 10:23:22.056] PCI-ISA link 2 routed to IRQ11
(XEN) [2019-10-28 10:23:22.063] irq.c:374: Dom18 PCI link 3 changed 0 -> 5
(d18) [2019-10-28 10:23:22.063] PCI-ISA link 3 routed to IRQ5
(d18) [2019-10-28 10:23:22.101] pci dev 01:3 INTA->IRQ10
(d18) [2019-10-28 10:23:22.103] pci dev 02:0 INTA->IRQ11
(d18) [2019-10-28 10:23:22.107] pci dev 04:0 INTA->IRQ5
(d18) [2019-10-28 10:23:22.109] pci dev 05:0 INTA->IRQ10
(d18) [2019-10-28 10:23:22.122] RAM in high memory; setting high_mem resource 
base to 10f8

Re: [Xen-devel] Xen-unstable: AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.

2019-10-30 Thread Sander Eikelenboom
On 30/10/2019 16:48, Jan Beulich wrote:
> On 28.10.2019 11:32, Sander Eikelenboom wrote:
>> While testing the latest xen-unstable and starting an HVM guest with 
>> pci-passtrough on my AMD machine,
>> my eye catched the following messages in xl dmesg I haven't seen before:
>>
>> (XEN) [2019-10-28 10:23:16.372] AMD-Vi: update_paging_mode Try to access 
>> pdev_list without aquiring pcidevs_lock.
> 
> Unfortunately this sits on the map/unmap path, and hence the
> violator is far up one of the many call chains. Therefore I'd
> like to ask that you rebuild and retry with the debugging
> patch below. In case you observe multiple different call
> trees, post them all please.
> 
> Jan

Hi Jan,

Call trace seems to be the same in all cases.

--
Sander


(XEN) [2019-10-30 22:07:05.748] AMD-Vi: update_paging_mode Try to access 
pdev_list without aquiring pcidevs_lock.
(XEN) [2019-10-30 22:07:05.748] [ Xen-4.13.0-rc  x86_64  debug=y   Not 
tainted ]
(XEN) [2019-10-30 22:07:05.748] CPU:1
(XEN) [2019-10-30 22:07:05.748] RIP:e008:[] 
iommu_map.c#update_paging_mode+0x1f2/0x3eb
(XEN) [2019-10-30 22:07:05.748] RFLAGS: 00010286   CONTEXT: hypervisor 
(d0v2)
(XEN) [2019-10-30 22:07:05.748] rax: 830523f9   rbx: 82e004905f00   
rcx: 
(XEN) [2019-10-30 22:07:05.748] rdx: 0001   rsi: 000a   
rdi: 82d0804a0698
(XEN) [2019-10-30 22:07:05.748] rbp: 830523f9f848   rsp: 830523f9f808   
r8:  8305320a
(XEN) [2019-10-30 22:07:05.748] r9:  0038   r10: 0002   
r11: 000a
(XEN) [2019-10-30 22:07:05.748] r12: 82e004905f00   r13: 0003   
r14: 0003
(XEN) [2019-10-30 22:07:05.748] r15: 83041fb83000   cr0: 80050033   
cr4: 06e0
(XEN) [2019-10-30 22:07:05.748] cr3: 00040a58d000   cr2: 8880604835a0
(XEN) [2019-10-30 22:07:05.748] fsb: 7f4b8f899bc0   gsb: 88807d48   
gss: 
(XEN) [2019-10-30 22:07:05.748] ds:    es:    fs:    gs:    ss: 
e010   cs: e008
(XEN) [2019-10-30 22:07:05.748] Xen code around  
(iommu_map.c#update_paging_mode+0x1f2/0x3eb):
(XEN) [2019-10-30 22:07:05.748]  3d 3b 7b 22 00 00 75 07 <0f> 0b e9 c2 01 00 00 
48 8d 35 1a ce 13 00 48 8d
(XEN) [2019-10-30 22:07:05.748] Xen stack trace from rsp=830523f9f808:
(XEN) [2019-10-30 22:07:05.748]82e00a6bc6e0 82e00a6bc6e0 
83041fb83000 83041fb83000
(XEN) [2019-10-30 22:07:05.748]83041fb83148 000feff8 
83041fb83150 830523f9f93c
(XEN) [2019-10-30 22:07:05.748]830523f9f8c8 82d080265ded 
000380240580 002482f9
(XEN) [2019-10-30 22:07:05.748]  
 
(XEN) [2019-10-30 22:07:05.748]  
 
(XEN) [2019-10-30 22:07:05.748]83041fb83000 000feff8 
000feff8 002482f9
(XEN) [2019-10-30 22:07:05.748]830523f9f928 82d0802583b6 
000feff8 0001
(XEN) [2019-10-30 22:07:05.748]0003802405da 002482f9 
830523f9f93c 0003
(XEN) [2019-10-30 22:07:05.748]83041fb83000 000feff8 
 
(XEN) [2019-10-30 22:07:05.748]830523f9f960 82d0802586fb 
48834780 0003
(XEN) [2019-10-30 22:07:05.748] 830248834780 
000feff8 830523f9f9f8
(XEN) [2019-10-30 22:07:05.748]82d08034a4a6 8038a845 
 820040024fc0
(XEN) [2019-10-30 22:07:05.748]83041fb83000 002482f9 
 8002484b5367
(XEN) [2019-10-30 22:07:05.748]82004002d5a0 8002484b5367 
 
(XEN) [2019-10-30 22:07:05.748]820040024000  
002482f9 000feff8
(XEN) [2019-10-30 22:07:05.748]0001 830248834780 
830523f9fa50 82d080342e13
(XEN) [2019-10-30 22:07:05.748] 0007 
83041fb83000 0023
(XEN) [2019-10-30 22:07:05.748]002482fa 830248834780 
 002482f9
(XEN) [2019-10-30 22:07:05.748]000feff8 830523f9fac8 
82d080343c52 23f9fa78
(XEN) [2019-10-30 22:07:05.748]0001 002482f9 
000feff8 830523f9fa98
(XEN) [2019-10-30 22:07:05.748] Xen call trace:
(XEN) [2019-10-30 22:07:05.748][] R 
iommu_map.c#update_paging_mode+0x1f2/0x3eb
(XEN) [2019-10-30 22:07:05.748][] F 
amd_iommu_map_page+0x72/0x1c2
(XEN) [2019-10-30 22:07:05.748][] F iommu_map+0x98/0x17e
(XEN) [2019-10-30 22:07:05.748][] F 
iommu_legacy_map+0x28/0x73
(XEN) [2019-10-30 22:07:05.748][] F 
p2m-pt.c#p2m_pt_set_entry+0x4d3/0x844
(XEN) [2019-10-30 22:07:05.748][] F 
p2m_set_entry+0x91/0x128
(X

Re: [Xen-devel] [XEN PATCH for-4.13 v2 0/6] Fix: libxl workaround, multiple connection to single QMP socket

2019-10-30 Thread Sander Eikelenboom
On 30/10/2019 19:06, Anthony PERARD wrote:
> Patch series available in this git branch:
> https://xenbits.xen.org/git-http/people/aperard/xen-unstable.git 
> br.fix-ev_qmp-multi-connect-v2
> 
> Hi,
> 
> QEMU's QMP socket doesn't allow multiple concurrent connection. Also, it
> listen() on the socket with a `backlog' of only 1. On Linux at least, once 
> that
> backlog is filled connect() will return EAGAIN if the socket fd is
> non-blocking. libxl may attempt many concurrent connect() attempt if for
> example a guest is started with several PCI passthrough devices, and a
> connect() failure lead to a failure to start the guest.
> 
> Since we can't change the listen()'s `backlog' that QEMU use, we need other
> ways to workaround the issue. This patch series introduce a lock to acquire
> before attempting to connect() to the QMP socket. Since the lock might be held
> for to long, the series also introduce a way to cancel the acquisition of the
> lock, this means killing the process that tries to get the lock.
> 
> See thread[1] for discussed alternative.
> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-10/msg01815.html
> 
> Cheers,
> 
> Anthony PERARD (6):

Hi Anthony,

Re-tested, especially the pci-pt part, still works for me.
Thanks again (and thanks for providing a git branch)

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable: AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.

2019-10-31 Thread Sander Eikelenboom
On 31/10/2019 10:18, Jan Beulich wrote:
> On 31.10.2019 09:35, Sander Eikelenboom wrote:
>> Platform is perhaps what specific (older AMD 890FX chipset) and I need the 
>> bios workaround:
>> ivrs_ioapic[6]=00:14.0 iommu=on.
> 
> Shouldn't matter here.
> 
>> On the other hand, this has ran like this for quite some time.
>>
>> I have 3 guests (HVM) for which i use PCI passthrough and 
>> for each of those 3 guests I get this message *once* on start of the guest.
>>  One guest has a soundcard passed through,
>>  One guest has a USB2 card passed through,
>>  One guest has a USB3 card passed through.
>>
>> Another observation is that both the soundcard and USB2 card
>> still seem to function despite the message.
> 
> Reality is - this message is benign as long as you don't do PCI
> hot (un)plug.

I don't use any of:
 pci-attach
 pci-detach
 pci-list
 pci-assignable-add
 pci-assignable-remove
 pci-assignable-list

Only shutting down and (re)starting VMs with the devices specified in
the vm cfg file.

>> The USB3 controller goes haywire though (a lot of driver messages in the 
>> guest during init).
> 
> As a consequence I don't think there's a connection between this
> and the observed message.

Ok, although it functions fine when (with same kernel etc. reverting to
the commit I referenced to below), if so, that would be another issue then.

CC'ed Juergen as release manager so he is aware.

>> I could try to bisect, but that would be somewhere next week before I can 
>> get to that.
>>
>> At present I run with a tree with as latest commit 
>> ee7170822f1fc209f33feb47b268bab35541351d,
>> which is stable for me. This predates some of the IOMMU changes and 
>> Anthony's QMP work that had
>> some issues, but that would be the last known real good point for me to 
>> start a bisect from.
> 
> I.e. at that point you didn't observe this message, yet?

With ee7170822f1fc209f33feb47b268bab35541351d, nor this message, nor the
"INVALID_DEV_REQUEST", even with longer uptimes.

--
Sander

> Jan
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable: AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.

2019-10-31 Thread Sander Eikelenboom
On 31/10/2019 11:15, Jan Beulich wrote:
> On 30.10.2019 23:21, Sander Eikelenboom wrote:
>> Call trace seems to be the same in all cases.
>>
>> --
>> Sander
>>
>>
>> (XEN) [2019-10-30 22:07:05.748] AMD-Vi: update_paging_mode Try to access 
>> pdev_list without aquiring pcidevs_lock.
>> (XEN) [2019-10-30 22:07:05.748] [ Xen-4.13.0-rc  x86_64  debug=y   Not 
>> tainted ]
>> (XEN) [2019-10-30 22:07:05.748] CPU:1
>> (XEN) [2019-10-30 22:07:05.748] RIP:e008:[] 
>> iommu_map.c#update_paging_mode+0x1f2/0x3eb
>> (XEN) [2019-10-30 22:07:05.748] RFLAGS: 00010286   CONTEXT: 
>> hypervisor (d0v2)
> 
> I didn't pay attention to this when writing my earlier reply: The
> likely culprit looks to be f89f555827 ("remove late (on-demand)
> construction of IOMMU page tables"). Prior to this I assume IOMMU
> page tables got constructed only after ...

OK, I tested f89f555827 and f89f555827~1, my observations:

with f89f555827~1:
- I'm NOT seeing the aquiring pcidevs_lock message
- the usb3 controller is also working.

with f89f555827:
- I'm now seeing the aquiring pcidevs_lock messages.
- but I'm NOT seeing them *once* per booting guest, but multiple times.
- the usb3 controller is still working.

with staging:
- Seeing the aquiring pcidevs_lock messages, but only *once* per guest 
boot.
- the usb3 controller goes haywire in the guest.

So you seem to be right about both things:
- f89f555827 is the culprit for the aquiring pcidevs_lock messages. 
  Although I get less of them with current staging, so some other later 
patch must have had some influence
  in reducing the amount.

- The usb3 controller malfunctioning seems indeed to be a separate issue 
(which seems unfortunate, 
  because a bisect seems to become even nastier with all the intertwined 
pci-passthrough issues).
  
  Perhaps this one is then related to the only *once* occuring message: 
  (XEN) [2019-10-31 20:39:30.746] AMD-Vi: INVALID_DEV_REQUEST 0800 
8a00 f8000840 00fd
 
  While in the guest it is endlessly repeating:
  [  231.385566] xhci_hcd :00:05.0: Max number of devices this xHCI 
host supports is 32.
  [  231.407351] usb usb1-port2: couldn't allocate usb_device

  Hopefully this also gives you a hunch as to which commits to look at.

--
Sander

>> (XEN) [2019-10-30 22:07:05.748] Xen call trace:
>> (XEN) [2019-10-30 22:07:05.748][] R 
>> iommu_map.c#update_paging_mode+0x1f2/0x3eb
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> amd_iommu_map_page+0x72/0x1c2
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> iommu_map+0x98/0x17e
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> iommu_legacy_map+0x28/0x73
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> p2m-pt.c#p2m_pt_set_entry+0x4d3/0x844
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> p2m_set_entry+0x91/0x128
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> guest_physmap_add_entry+0x39f/0x5a3
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> guest_physmap_add_page+0x12f/0x138
>> (XEN) [2019-10-30 22:07:05.748][] F 
>> memory.c#populate_physmap+0x2e3/0x505
> 
> ... Dom0 had populated the new guest's physmap.
> 
> Anyway, as odd as it may seem I guess there's little choice
> besides making populate_physmap() (and likely a few others)
> acquire the lock.
> 
> Jan
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/3] AMD/IOMMU: re-work mode updating

2019-11-06 Thread Sander Eikelenboom
On 06/11/2019 16:16, Jan Beulich wrote:
> update_paging_mode() in the AMD IOMMU code expects to be invoked with
> the PCI devices lock held. The check occurring only when the mode
> actually needs updating, the violation of this rule by the majority
> of callers did go unnoticed until per-domain IOMMU setup was changed
> to do away with on-demand creation of IOMMU page tables.
> 
> Unfortunately the only half way reasonable fix to this that I could
> come up with requires more re-work than would seem desirable at this
> time of the release process, but addressing the issue seems
> unavoidable to me as its manifestation is a regression from the
> IOMMU page table setup re-work. The change also isn't without risk
> of further regressions - if in patch 2 I've missed a code path that
> would also need to invoke the new hook, then this might mean non-
> working guests (with passed-through devices on AMD hardware).
> 
> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
> 2: introduce GFN notification for translated domains
> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode
> 
> Jan
> 

Hi Jan,

I just tested and I don't get the  "pcidevs" message any more.

I assume this only was a fix for that issue, so it's probably expected
that the other issue:
   AMD-Vi: INVALID_DEV_REQUEST 0800 8a00 f8000840 00fd
   and malfunctioning device in one of the guests.
is still around.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] OSStest priorities

2019-11-07 Thread Sander Eikelenboom
On 07/11/2019 17:44, Jürgen Groß wrote:
> Hi Ian,
> 
> in the Xen community call we agreed to try to speed up OSStest for
> xen-unstable in order to make better progress with the 4.13 release.
> 
> Could you please suspend testing for Xen 4.10 and older (Jan agreed on
> that), and disable the Linux kernel tests which are currently failing
> (including the bisecting)?
> 
> This should free lots of resources in OSStest reducing xen-unstable
> test latencies.
> 
> 
> Juergen
> 
> 

The following tests have quite a long timeout and will always fail,
(although last time I manually checked windows 10 actually installed fine
on a HVM on my machine).

 test-amd64-amd64-xl-qemuu-win10-i386 10 windows-installfail never pass
 test-amd64-i386-xl-qemut-win10-i386 10 windows-install fail never pass
 test-amd64-i386-xl-qemuu-win10-i386 10 windows-install fail never pass
 test-amd64-amd64-xl-qemut-win10-i386 10 windows-installfail never pass

perhaps suspending these in all tree until the underlying install image is fixed
or replaced, would also free some more resources, also after the release.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

2019-11-08 Thread Sander Eikelenboom
On 08/11/2019 19:49, Ian Jackson wrote:
> Earlier this week we discovered that sg-report-host-history was running
> extremely slowly.  We applied an emergency fix 0fa72b13f5af
>   sg-report-host-history: Reduce limit from 2000 to 200
> 
> The main problem is that sg-report-host-history runs once for each
> flight, and must generate a relevant history view of the recent
> history for each host - including much history that is already in the
> old version of the html file.
> 
> The slow part is asking the database about information about each job,
> including its final step, allocation step, etc.  (The main query which
> digs out relevant jobs is also rather time consuming it runs all in
> one go and takes only a minute or two.)
> 
> In this series we introduce a mechanism which caches much of the
> historical analysis.
> 
> It is not straightforward to reuse old html data as-is because we
> would have to do a merge sort with the new data and that would involve
> rewriting the alternating background colour (!)
> 
> So instead, we stuff the information we got from the database into
> comments in the HTML, which we can then scan on future runs.

Not mend to bike shed, so just for consideration:
- Have you considered (inline) css for the background colouring, or does
  it have to be html only  ?
- And for caching perhaps a materialized view with aggregated data only
  refreshed at a more convient time could perhaps help at the database
  level ?

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

2019-11-11 Thread Sander Eikelenboom
On 11/11/2019 12:00, Ian Jackson wrote:
> Sander Eikelenboom writes ("Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up 
> and restore host history"):
>> Not mend to bike shed, so just for consideration:
> 
> Suggestions are very welcome.  Be careful, I'm still looking for a
> co-maintainer :-).
/me is ducking under the table ;)
Seems to be quite a lot of intracate Perl, I never was a prince of Perl
and that hasn't got any better by not using it actively the past years.

>> - Have you considered (inline) css for the background colouring, or does
>>   it have to be html only  ?
> 
> There is no particular reason why it shouldn't be CSS.  Is there a
> reason why doing it in html causes problems for you ?

Not really, but especially applying style to alternating rows is now
quite simple with pseudo classes:

 tr:nth-child(even){
   background-color: grey;
 }

 tr:nth-child(even){
   background-color: white;
 }

You could stuff this in a  ... ,
so you don't have to repeat this every row for the common case.
For any special cases you could overrule based on class.
I happen to find it one of the most useful CSS features.

https://www.w3.org/wiki/CSS/Selectors/pseudo-classes/:nth-child

> The background colours for the cells are made with
>   report_altcolour
>   report_altchangecolour
> in Osstest/Executive.pm.
> 
> report_altcolour returns something that can be put into an element
> open tag, given a definite indication of whether the colour should be
> paler or darker.
> 
> report_altchangecolour is used to produce background colours which
> change when the value in the cell changes.
> 
> I think it would be easy to replace bgcolour= with some appropriate
> style= and some CSS.  Patches - even very rough ones - welcome.
> 
>> - And for caching perhaps a materialized view with aggregated data only
>>   refreshed at a more convient time could perhaps help at the database
>>   level ?
> 
> Maybe, but currently the archaeology algorithm is not expressed
> entirely in SQL so it couldn't be a materialised view.  And converting
> it to SQL would be annoying because SQL is a rather poor programming
> language.

It is a poor programming language, but it is very good at retrieving and
modifying data. Sometimes it takes some effort to wrap your head around
the way you have to specify what data you want and in what for, without
being to explicit in how it is supposed to be retrieved.

> It might be possible to, instead, have table(s) containing archaeology
> results.  I hadn't really properly considered that possibility.  That
> might well have been a better approach.  So thank you for your helpful
> prompt.  I will definitely bear this in mind for the future.

If I remember correctly Postgres is being used, perhaps there is stull
some relatively low hanging fruit when analyzing the performance of the
queries you run at the actual data.

> I'm not sure I feel like reengineering this particular series at this
> time, though.  One reason (apart from that I've done it like this now)
> is that the current approach has the advantage that it doesn't need a
> DB schema change.  I have a system for doing schema changes but they
> add risk and I don't want to do that in the Xen release freeze.

I understand, and I concur that that is probably the best at the moment.

I will take a look at the code somewhere this or next week and see if I
can get any familiarity with it and perhaps end up with some contributions.

--
Sander

> Regards,
> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable: AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.

2019-11-11 Thread Sander Eikelenboom
On 11/11/2019 16:35, Jan Beulich wrote:
> On 31.10.2019 21:48, Sander Eikelenboom wrote:
>> - The usb3 controller malfunctioning seems indeed to be a separate issue 
>> (which seems unfortunate, 
>>   because a bisect seems to become even nastier with all the intertwined 
>> pci-passthrough issues).
>>   
>>   Perhaps this one is then related to the only *once* occuring message: 
>>   (XEN) [2019-10-31 20:39:30.746] AMD-Vi: INVALID_DEV_REQUEST 
>> 0800 8a00 f8000840 00fd
>>  
>>   While in the guest it is endlessly repeating:
>>   [  231.385566] xhci_hcd :00:05.0: Max number of devices this 
>> xHCI host supports is 32.
>>   [  231.407351] usb usb1-port2: couldn't allocate usb_device
> 
> I'm uncertain whether there's a correlation: The device the Xen
> message is about is 08:00.0; please let us know what kind of device
> that is (the hypervisor log alone don't allow me to guess).
> 
> The specific type is described as "Posted write to the Interrupt/EOI
> range from an I/O device that has IntCtl=00b in the device’s DTE."
> This would make me guess 1b00c16bdf ("AMD/IOMMU: pre-fill all DTEs
> right after table allocation") is the culprit here, and I may need
> to hand you a debugging patch to gain some insight. But let me first
> take a look at sufficiently verbose lspci output from that system.
> 
> Jan
> 

Hi Jan,

When supplying "pci=nomsi" to the guest kernel, the device works fine,
and I don't get the "INVALID_DEV_REQUEST".

After reverting 1b00c16bdf, the device works fine 
and I don't get the INVALID_DEV_REQUEST, 

Below is the output of lspci -vvvknn from dom0 for 08:00.0:
- just after boot (device owned by pciback / dom0, not active yet)
- after the guests have started (owned by guest with a working device)

So it is enabling MSI-X interrupts, which is indeed different from the other 
devices I pass through which
seem to use legacy interrupts.
This also shows in the guest with a working device in /proc/interrupts:
 98:  17846  0  0  0  xen-pirq-msi-x 
xhci_hcd
 99:  0  0  0  0  xen-pirq-msi-x 
xhci_hcd
100:  0  0  0  0  xen-pirq-msi-x 
xhci_hcd
101:  0  0  0  0  xen-pirq-msi-x 
xhci_hcd
102:  0  0  0  0  xen-pirq-msi-x 
xhci_hcd

I forgot to take a snapshot of /proc/interrupts in the guest in the 
malfunctioning state.

--
Sander


just after boot (device owned by pciback / dom0, not active yet):
08:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host 
Controller [1033:0194] (rev 03) (prog-if 30 [XHCI])
Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [1043:8413]
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable: AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.

2019-11-12 Thread Sander Eikelenboom
On 12/11/2019 12:05, Jan Beulich wrote:
> On 11.11.2019 22:38, Sander Eikelenboom wrote:
>> When supplying "pci=nomsi" to the guest kernel, the device works fine,
>> and I don't get the "INVALID_DEV_REQUEST".
>>
>> After reverting 1b00c16bdf, the device works fine 
>> and I don't get the INVALID_DEV_REQUEST, 
> 
> Could you give the patch below a try? That commit took care of only
> securing ourselves, but not of relaxing things again when a device
> gets handed to a guest for actual use.
> 
> Jan

Hi Jan,

CC'ed Juergen, as he seems to be dropped off the CC-list at some time.

Just tested this patch: 
the device works fine and I don't get the INVALID_DEV_REQUEST.

This was the last remaining issue around pci passthrough I encountered, 
with all patches applied (yours and Anthony's) pci passthrough for me 
seems to work again as I was used to.

Thanks again for fixing the issues and providing the right educated guesses!

--
Sander


> AMD/IOMMU: restore DTE fields in amd_iommu_setup_domain_device()
> 
> Commit 1b00c16bdf ("AMD/IOMMU: pre-fill all DTEs right after table
> allocation") moved ourselves into a more secure default state, but
> didn't take sufficient care to also undo the effects when handing a
> previously disabled device back to a(nother) domain. Put the fields
> that may have been changed elsewhere back to their intended values
> (some fields amd_iommu_disable_domain_device() touches don't
> currently get written anywhere else, and hence don't need modifying
> here).
> 
> Reported-by: Sander Eikelenboom 
> Signed-off-by: Jan Beulich 
> 
> --- unstable.orig/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ unstable/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -114,11 +114,21 @@ static void amd_iommu_setup_domain_devic
>  
>  if ( !dte->v || !dte->tv )
>  {
> +const struct ivrs_mappings *ivrs_dev;
> +
>  /* bind DTE to domain page-tables */
>  amd_iommu_set_root_page_table(
>  dte, page_to_maddr(hd->arch.root_table), domain->domain_id,
>  hd->arch.paging_mode, valid);
>  
> +/* Undo what amd_iommu_disable_domain_device() may have done. */
> +ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
> +if ( dte->it_root )
> +dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
> +dte->iv = iommu_intremap;
> +dte->ex = ivrs_dev->dte_allow_exclusion;
> +dte->sys_mgt = MASK_EXTR(ivrs_dev->device_flags, 
> ACPI_IVHD_SYSTEM_MGMT);
> +
>  if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
>   iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
>  dte->i = ats_enabled;
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [XEN PATCH for-4.13 v2 0/6] Fix: libxl workaround, multiple connection to single QMP socket

2019-11-15 Thread Sander Eikelenboom
On 08/11/2019 07:06, Jürgen Groß wrote:
> On 30.10.19 19:06, Anthony PERARD wrote:
>> Patch series available in this git branch:
>> https://xenbits.xen.org/git-http/people/aperard/xen-unstable.git 
>> br.fix-ev_qmp-multi-connect-v2
>>
>> Hi,
>>
>> QEMU's QMP socket doesn't allow multiple concurrent connection. Also, it
>> listen() on the socket with a `backlog' of only 1. On Linux at least, once 
>> that
>> backlog is filled connect() will return EAGAIN if the socket fd is
>> non-blocking. libxl may attempt many concurrent connect() attempt if for
>> example a guest is started with several PCI passthrough devices, and a
>> connect() failure lead to a failure to start the guest.
>>
>> Since we can't change the listen()'s `backlog' that QEMU use, we need other
>> ways to workaround the issue. This patch series introduce a lock to acquire
>> before attempting to connect() to the QMP socket. Since the lock might be 
>> held
>> for to long, the series also introduce a way to cancel the acquisition of the
>> lock, this means killing the process that tries to get the lock.
>>
>> See thread[1] for discussed alternative.
>> [1] 
>> https://lists.xenproject.org/archives/html/xen-devel/2019-10/msg01815.html
>>
>> Cheers,
>>
>> Anthony PERARD (6):
>>libxl: Introduce libxl__ev_child_kill_deregister
>>libxl: Move libxl__ev_devlock declaration
>>libxl: Rename ev_devlock to ev_slowlock
>>libxl: Introduce libxl__ev_slowlock_dispose
>>libxl: libxl__ev_qmp_send now takes an egc
>>libxl_qmp: Have a lock for QMP socket access
>>
>>   tools/libxl/libxl_disk.c|  16 ++--
>>   tools/libxl/libxl_dm.c  |   8 +-
>>   tools/libxl/libxl_dom_save.c|   2 +-
>>   tools/libxl/libxl_dom_suspend.c |   2 +-
>>   tools/libxl/libxl_domain.c  |  18 ++---
>>   tools/libxl/libxl_event.c   |   6 +-
>>   tools/libxl/libxl_fork.c|  48 
>>   tools/libxl/libxl_internal.c|  41 +++---
>>   tools/libxl/libxl_internal.h| 130 +++-
>>   tools/libxl/libxl_pci.c |   8 +-
>>   tools/libxl/libxl_qmp.c | 119 -
>>   tools/libxl/libxl_usb.c |  28 ---
>>   12 files changed, 301 insertions(+), 125 deletions(-)
>>
> 
> For the series:
> 
> Release-acked-by: Juergen Gross 
> 
> 
> Juergen

Hi Juergen,

Since a lot more recent patches have been committed, but these don't seem to.
I was wondering if these fell through the cracks.

--
Sander



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [XEN PATCH for-4.13] libxl_pci: Don't hold QMP connection while waiting

2019-11-15 Thread Sander Eikelenboom
On 08/11/2019 07:07, Jürgen Groß wrote:
> On 31.10.19 13:17, Anthony PERARD wrote:
>> After sending the 'device_del' command for a PCI passthrough device,
>> we wait until QEMU has effectively deleted the device, this involves
>> executing more QMP commands. While waiting, libxl hold the connection.
>>
>> It isn't necessary to hold the connection and it prevents others from
>> making progress, so this patch releases the QMP connection.
>>
>> For background:
>>  e.g., when a guest is created with several pci passthrough
>>  attached, on `xl destroy` all the devices needs to be detach, and
>>  this is usually what happens:
>>  - 'device_del' called for the 1st pci device
>>  - 'query-pci' checking if pci still there, it is
>>  - wait 1s
>>  - 'query-pci' checking again, and it's gone
>>  -> now the same can be done for the second pci device, so
>>  plenty of waiting on others when pci detach can be done in
>>  parallel.
>>
>>  On shutdown, libxl usually keeps waiting because QEMU never
>>  releases the device because the guest kernel never responds QEMU's
>>  unplug queries. So detaching of the 1st device waits until a
>>  timeout stops it, and since the same timeout is setup at the same
>>  time for the other devices to detach, the 'device_del' command is
>>  never sent for those.
>>
>> Signed-off-by: Anthony PERARD 
> 
> Release-acked-by: Juergen Gross 
> 
> 
> Juergen


Hi Juergen,

Since a lot more recent patches have been committed, but these don't
seem to.
I was wondering if this one fell through the cracks.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [linux-linus bisection] complete test-amd64-amd64-xl-pvhv2-amd

2018-04-03 Thread Sander Eikelenboom
On 03/04/18 12:29, Juergen Gross wrote:
> On 03/04/18 12:19, osstest service owner wrote:
>> branch xen-unstable
>> xenbranch xen-unstable
>> job test-amd64-amd64-xl-pvhv2-amd
>> testid guest-start
>>
>> Tree: linux 
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>> Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git
>> Tree: qemu git://xenbits.xen.org/qemu-xen-traditional.git
>> Tree: qemuu git://xenbits.xen.org/qemu-xen.git
>> Tree: xen git://xenbits.xen.org/xen.git
>>
>> *** Found and reproduced problem changeset ***
>>
>>   Bug is in tree:  xen git://xenbits.xen.org/xen.git
>>   Bug introduced:  4a5733771e6f33918eba07b584e564a67ac1
>>   Bug not present: 1c2e0f9e4f263714db917eb54f8d1c2d1463ed4c
>>   Last fail repro: http://logs.test-lab.xenproject.org/osstest/logs/118498/
>>
>>
>>   commit 4a5733771e6f33918eba07b584e564a67ac1
>>   Author: Juergen Gross 
>>   Date:   Fri Dec 1 15:14:07 2017 +0100
>>   
>>   libxl: put RSDP for PVH guest near 4GB
>>   
>>   Instead of locating the RSDP table below 1MB put it just below 4GB
>>   like the rest of the ACPI tables in case of PVH guests. This will
>>   avoid punching more holes than necessary into the memory map.
>>   
>>   Signed-off-by: Juergen Gross 
>>   Acked-by: Wei Liu 
>>   Reviewed-by: Roger Pau Monné 
> 
> The corresponding Linux kernel patch just made it upstream.
> 
> 
> Juergen

Hi Juergen,

Are those kernel patches heading for linux-stable as well ?

I ask this, because it would be nice to be able to use PVH on Xen 4.11 release 
with a distro kernel
(4.9 or 4.14 stable for instance for Debian).
PVH worked fine with xen-4.11-to-be up until this commit, so the kernel patches 
fix this (non-kernel) regression.

--
Sander

 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [linux-linus bisection] complete test-amd64-amd64-xl-pvhv2-amd

2018-04-05 Thread Sander Eikelenboom
On 05/04/18 09:11, Juergen Gross wrote:
> On 03/04/18 18:55, Sander Eikelenboom wrote:
>> On 03/04/18 12:29, Juergen Gross wrote:
>>> On 03/04/18 12:19, osstest service owner wrote:
>>>> branch xen-unstable
>>>> xenbranch xen-unstable
>>>> job test-amd64-amd64-xl-pvhv2-amd
>>>> testid guest-start
>>>>
>>>> Tree: linux 
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>>>> Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git
>>>> Tree: qemu git://xenbits.xen.org/qemu-xen-traditional.git
>>>> Tree: qemuu git://xenbits.xen.org/qemu-xen.git
>>>> Tree: xen git://xenbits.xen.org/xen.git
>>>>
>>>> *** Found and reproduced problem changeset ***
>>>>
>>>>   Bug is in tree:  xen git://xenbits.xen.org/xen.git
>>>>   Bug introduced:  4a5733771e6f33918eba07b584e564a67ac1
>>>>   Bug not present: 1c2e0f9e4f263714db917eb54f8d1c2d1463ed4c
>>>>   Last fail repro: http://logs.test-lab.xenproject.org/osstest/logs/118498/
>>>>
>>>>
>>>>   commit 4a5733771e6f33918eba07b584e564a67ac1
>>>>   Author: Juergen Gross 
>>>>   Date:   Fri Dec 1 15:14:07 2017 +0100
>>>>   
>>>>   libxl: put RSDP for PVH guest near 4GB
>>>>   
>>>>   Instead of locating the RSDP table below 1MB put it just below 4GB
>>>>   like the rest of the ACPI tables in case of PVH guests. This will
>>>>   avoid punching more holes than necessary into the memory map.
>>>>   
>>>>   Signed-off-by: Juergen Gross 
>>>>   Acked-by: Wei Liu 
>>>>   Reviewed-by: Roger Pau Monné 
>>>
>>> The corresponding Linux kernel patch just made it upstream.
>>>
>>>
>>> Juergen
>>
>> Hi Juergen,
>>
>> Are those kernel patches heading for linux-stable as well ?
> 
> Greg refuses to take them, sorry.

I noticed the conversation, it's unfortunate but reality, thanks for the effort 
anyhow.
(perhaps a lesson for the future that if one intends to get a patch in stable 
as well, 
make the commit message indicate as clearly as possible that it is a kernel 
regression, 
to hopefully prevent a discussion from happening in the first place.)

Perhaps it is something for the release notes of xen-4.11 / wiki, 
that you require a 4.17+ (or distro patched) kernel for PVH ?

--
Sander

>> I ask this, because it would be nice to be able to use PVH on Xen 4.11 
>> release with a distro kernel
>> (4.9 or 4.14 stable for instance for Debian).
> 
> I think you have to request the distributor to take them (I'll add them
> to the SUSE kernel for SLE15 / openSUSE 15).
> 
>> PVH worked fine with xen-4.11-to-be up until this commit, so the kernel 
>> patches fix this (non-kernel) regression.
> 
> It _is_ a kernel regression, as the kernel wasn't using the PVH
> interface correctly.
> 
> 
> Juergen
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] tools/libacpi printf output to logging instead of console/stdout ?

2018-03-07 Thread Sander Eikelenboom
L.S.,

When starting a guest with the 'xl create' command (non-verbose) i get
this extra output on PVH guest types only:

S3 disabled
S4 disabled
CONV disabled


It seems libacpi/* only contains normal printf's, so for the other guest
types i probably just never triggered one of them.

Shouldn't these printf's go to logging instead of console/stdout ?

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] tools/libacpi printf output to logging instead of console/stdout ?

2018-03-08 Thread Sander Eikelenboom
On 08/03/18 10:09, Jan Beulich wrote:
 On 07.03.18 at 21:52,  wrote:
>> When starting a guest with the 'xl create' command (non-verbose) i get
>> this extra output on PVH guest types only:
>>
>> S3 disabled
>> S4 disabled
>> CONV disabled
>>
>>
>> It seems libacpi/* only contains normal printf's, so for the other guest
>> types i probably just never triggered one of them.
>>
>> Shouldn't these printf's go to logging instead of console/stdout ?
> 
> I think it's the responsibility of the executable linking to that library
> to suitably set up / redirect stdout. There not being anything like
> "stdlog", I'm also not sure where you would think libacpi should
> send them (if it was to control this itself) - surely not stderr.

The extra output seems only informational, not even a warning, so stderr
seems wrong indeed.

With my novice C skills it seems that:
The difference between HVM and PVH is that 
in the HVM case the code of libacpi is linked to and gets run by
the hvmloader binary, which libxl captures all the output from
and only prints on verbose invocation of the xl tool and/or the xen log
(on debug builds).

In the PVH case the acpi tables are generated by libxl it
self (libxl_x86_acpi.c: libxl__dom_load_acpi()), which is linked to libacpi 
directly, 
hence the output can't be captured separately since it is not a separate binary.

Would probably be hard to align the logging with those 2 different way of using 
libacpi ?

--
Sander
 
> Jan
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Regression with commit "x86/pv: Drop int80_bounce from struct pv_vcpu" f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1

2018-03-10 Thread Sander Eikelenboom
Hi Andrew,

It seems commit "x86/pv: Drop int80_bounce from struct pv_vcpu" 
(f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1) causes an issue on my machine, 
an AMD phenom X6.

When trying to installing a new kernel package which runs the Debian
update-initramfs tools with xen-unstable which happened to be at commit 
c9bd8a73656d7435b1055ee8825823aee995993e as last commit the tool stalls
and i get this kernel splat:

[  284.910674] BUG: unable to handle kernel NULL pointer dereference at 

[  284.919696] IP:   (null)
[  284.928315] PGD 0 P4D 0 
[  284.943343] Oops: 0010 [#1] SMP NOPTI
[  284.957008] Modules linked in:
[  284.965521] CPU: 5 PID: 24729 Comm: ld-linux.so.2 Not tainted 
4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
[  284.974154] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[  284.983198] RIP: e030:  (null)
[  284.992006] RSP: e02b:c90001497ed8 EFLAGS: 00010286
[  285.000612] RAX:  RBX: 880074c64500 RCX: 82f8d1c0
[  285.009122] RDX: 82f8d1c0 RSI: 20020002 RDI: 82f8d1c0
[  285.017598] RBP: 880074c64b7c R08:  R09: 
[  285.025999] R10:  R11:  R12: 82f8d1c0
[  285.034400] R13:  R14:  R15: 880074c64b50
[  285.042718] FS:  7f919fe2eb40() GS:88007d14() 
knlGS:
[  285.051001] CS:  e033 DS: 002b ES: 002b CR0: 80050033
[  285.059458] CR2:  CR3: 02824000 CR4: 0660
[  285.067813] Call Trace:
[  285.075947]  ? task_work_run+0x85/0xa0
[  285.084025]  ? exit_to_usermode_loop+0x72/0x80
[  285.091980]  ? do_int80_syscall_32+0xfe/0x120
[  285.099896]  ? entry_INT80_compat+0x7f/0x90
[  285.107688]  ? fpu__drop+0x23/0x40
[  285.115362] Code:  Bad RIP value.
[  285.123072] RIP:   (null) RSP: c90001497ed8
[  285.130714] CR2: 
[  285.138219] ---[ end trace 4d3317497f4ba022 ]---
[  285.145671] Fixing recursive fault but reboot is needed!

After updating xen-unstable to the latest available commit 
185413355fe331cbc926d48568838227234c9a20,
the tool doesn't stall anymore but i still get a kernel splat:

[  198.594638] [ cut here ]
[  198.594641] Invalid address limit on user-mode return
[  198.594651] WARNING: CPU: 1 PID: 75 at ./include/linux/syscalls.h:236 
do_int80_syscall_32+0xe5/0x120
[  198.594652] Modules linked in:
[  198.594655] CPU: 1 PID: 75 Comm: kworker/1:1 Not tainted 
4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
[  198.594656] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 
09/13/2010
[  198.594658] Workqueue: events free_work
[  198.594660] RIP: e030:do_int80_syscall_32+0xe5/0x120
[  198.594661] RSP: e02b:c9b8ff40 EFLAGS: 00010086
[  198.594662] RAX: 0029 RBX: c9b8ff58 RCX: 82868e38
[  198.594663] RDX: 0001 RSI: 0001 RDI: 0001
[  198.594664] RBP: 880078623980 R08: 0dfa R09: 063b
[  198.594664] R10:  R11: 063b R12: 
[  198.594665] R13:  R14:  R15: 
[  198.594672] FS:  7fa252372b40() GS:88007d04() 
knlGS:
[  198.594673] CS:  e033 DS:  ES:  CR0: 80050033
[  198.594674] CR2: f7f303e4 CR3: 02824000 CR4: 0660
[  198.594676] Call Trace:
[  198.594683]  entry_INT80_compat+0x7f/0x90
[  198.594685]  ? vunmap_page_range+0x2a0/0x340
[  198.594686] Code: 03 7f 48 8b 75 00 f7 c6 0e 38 00 00 75 2e 83 65 08 f9 5b 
5d c3 e8 0c fb ff ff e9 53 ff ff ff 48 c7 c7 58 35 57 82 e8 ab 3e 0c 00 <0f> 0b 
bf 09 00 00 00 48 89 ee e8 8c 00 0d 00 eb b8 48 89 df e8 
[  198.594706] ---[ end trace 90bcd2147bc825ef ]---

After reverting commit f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1 the issue is 
gone.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] libxl: put RSDP for PVH guest near 4GB

2018-03-12 Thread Sander Eikelenboom
On 19/02/18 22:13, Sander Eikelenboom wrote:
> On 19/02/18 11:16, Juergen Gross wrote:
>> On 19/02/18 10:47, Sander Eikelenboom wrote:
>>> On 24/01/18 16:26, George Dunlap wrote:
>>>> On Wed, Jan 24, 2018 at 3:20 PM, Juergen Gross  wrote:
>>>>> On 24/01/18 16:07, George Dunlap wrote:
>>>>>> On Wed, Jan 24, 2018 at 2:10 PM, Boris Ostrovsky
>>>>>>  wrote:
>>>>>>> On 01/24/2018 07:06 AM, Juergen Gross wrote:
>>>>>>>> On 24/01/18 11:54, Roger Pau Monné wrote:
>>>>>>>>> On Wed, Jan 24, 2018 at 10:42:39AM +, George Dunlap wrote:
>>>>>>>>>> On Wed, Jan 24, 2018 at 2:41 AM, Boris Ostrovsky
>>>>>>>>>>  wrote:
>>>>>>>>>>> On 01/18/2018 05:33 AM, Wei Liu wrote:
>>>>>>>>>>>> On Thu, Jan 18, 2018 at 11:31:32AM +0100, Juergen Gross wrote:
>>>>>>>>>>>>> Wei,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 01/12/17 15:14, Juergen Gross wrote:
>>>>>>>>>>>>>> Instead of locating the RSDP table below 1MB put it just below 
>>>>>>>>>>>>>> 4GB
>>>>>>>>>>>>>> like the rest of the ACPI tables in case of PVH guests. This will
>>>>>>>>>>>>>> avoid punching more holes than necessary into the memory map.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Juergen Gross 
>>>>>>>>>>>>>> Acked-by: Wei Liu 
>>>>>>>>>>>>> Mind applying this one?
>>>>>>>>>>>> Don't worry, it is in my queue.
>>>>>>>>>>>>
>>>>>>>>>>>> Will come to this and other patches I accumulated soon.
>>>>>>>>>>>>
>>>>>>>>>>>> Wei.
>>>>>>>>>>> This requires kernel changes, doesn't it?
>>>>>>>>>>>
>>>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2017-12/msg00714.html
>>>>>>>>>>>
>>>>>>>>>>> And this series apparently never made it to the tree.
>>>>>>>>>>>
>>>>>>>>>>> PVH guests are broken now on staging.
>>>>>>>>>> And the Linux side of PVH is officially supported now, right?
>>>>>>>
>>>>>>>
>>>>>>> AFAIK PVH is still considered a tech preview --- Linux or Xen.
>>>>>>
>>>>>> From SUPPORT.md:
>>>>>>
>>>>>> ### x86/PVH guest
>>>>>>
>>>>>> Status: Supported
>>>>>>
>>>>>> I was under the impression that PVH guest in Linux was complete and
>>>>>> stable as of Linux 4.11.  If that's not true it should have been
>>>>>> brought up during the 4.10 development cycle, where we declared PVH
>>>>>> domUs as "supported".
>>>>>
>>>>> So what is the problem here?
>>>>>
>>>>> - current Linux can't be booted as PVH guest with xen-unstable due to
>>>>>   a bug in Linux, patches for Linux are being worked on
>>>>> - booting Linux as PVH guest with xen 4.10 is working
>>>>
>>>> I was responding to Boris's claim that PVH is considered tech preview.
>>>> I can't say anything one way or the other about PVH in Linux, but PVH
>>>> in Xen is definitely now considered supported.
>>>>
>>>> My subsequent response to Roger ("FWIW I can buy this argument") was
>>>> meant to indicate I didn't have any more objection to the approach you
>>>> guys were planning on taking.
>>>>
>>>>  -George
>>>
>>> L.S.,
>>>
>>> Seems I lost track, is there any progress on this issue ?
>>> (doesn't seem a fix has landed in 4.16-rc2 yet).
>>
>> Just sent a new patch series.
> 
> Just tested and it works fine here.

Hi Juergen,

I don't know by which tree those patches should arrive at Linus,
so i can't check if they fell through the cracks somewhere, but 4.16-rc5
hasn't got them yet.

--
Sander


> 
> --
> Sander
> 
>>
>> Juergen
>>
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Regression with commit "x86/pv: Drop int80_bounce from struct pv_vcpu" f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1

2018-03-12 Thread Sander Eikelenboom
On 12/03/18 20:05, Andrew Cooper wrote:
> On 10/03/18 16:27, Andrew Cooper wrote:
>> On 10/03/2018 16:14, Sander Eikelenboom wrote:
>>> Hi Andrew,
>>>
>>> It seems commit "x86/pv: Drop int80_bounce from struct pv_vcpu" 
>>> (f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1) causes an issue on my machine, 
>>> an AMD phenom X6.
>>>
>>> When trying to installing a new kernel package which runs the Debian
>>> update-initramfs tools with xen-unstable which happened to be at commit 
>>> c9bd8a73656d7435b1055ee8825823aee995993e as last commit the tool stalls
>>> and i get this kernel splat:
>>>
>>> [  284.910674] BUG: unable to handle kernel NULL pointer dereference at 
>>> 
>>> [  284.919696] IP:   (null)
>>> [  284.928315] PGD 0 P4D 0 
>>> [  284.943343] Oops: 0010 [#1] SMP NOPTI
>>> [  284.957008] Modules linked in:
>>> [  284.965521] CPU: 5 PID: 24729 Comm: ld-linux.so.2 Not tainted 
>>> 4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
>>> [  284.974154] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>>> V1.8B1 09/13/2010
>>> [  284.983198] RIP: e030:  (null)
>>> [  284.992006] RSP: e02b:c90001497ed8 EFLAGS: 00010286
>>> [  285.000612] RAX:  RBX: 880074c64500 RCX: 
>>> 82f8d1c0
>>> [  285.009122] RDX: 82f8d1c0 RSI: 20020002 RDI: 
>>> 82f8d1c0
>>> [  285.017598] RBP: 880074c64b7c R08:  R09: 
>>> 
>>> [  285.025999] R10:  R11:  R12: 
>>> 82f8d1c0
>>> [  285.034400] R13:  R14:  R15: 
>>> 880074c64b50
>>> [  285.042718] FS:  7f919fe2eb40() GS:88007d14() 
>>> knlGS:
>>> [  285.051001] CS:  e033 DS: 002b ES: 002b CR0: 80050033
>>> [  285.059458] CR2:  CR3: 02824000 CR4: 
>>> 0660
>>> [  285.067813] Call Trace:
>>> [  285.075947]  ? task_work_run+0x85/0xa0
>>> [  285.084025]  ? exit_to_usermode_loop+0x72/0x80
>>> [  285.091980]  ? do_int80_syscall_32+0xfe/0x120
>>> [  285.099896]  ? entry_INT80_compat+0x7f/0x90
>>> [  285.107688]  ? fpu__drop+0x23/0x40
>>> [  285.115362] Code:  Bad RIP value.
>>> [  285.123072] RIP:   (null) RSP: c90001497ed8
>>> [  285.130714] CR2: 
>>> [  285.138219] ---[ end trace 4d3317497f4ba022 ]---
>>> [  285.145671] Fixing recursive fault but reboot is needed!
>>>
>>> After updating xen-unstable to the latest available commit 
>>> 185413355fe331cbc926d48568838227234c9a20,
>>> the tool doesn't stall anymore but i still get a kernel splat:
>>>
>>> [  198.594638] [ cut here ]
>>> [  198.594641] Invalid address limit on user-mode return
>>> [  198.594651] WARNING: CPU: 1 PID: 75 at ./include/linux/syscalls.h:236 
>>> do_int80_syscall_32+0xe5/0x120
>>> [  198.594652] Modules linked in:
>>> [  198.594655] CPU: 1 PID: 75 Comm: kworker/1:1 Not tainted 
>>> 4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
>>> [  198.594656] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>>> V1.8B1 09/13/2010
>>> [  198.594658] Workqueue: events free_work
>>> [  198.594660] RIP: e030:do_int80_syscall_32+0xe5/0x120
>>> [  198.594661] RSP: e02b:c9b8ff40 EFLAGS: 00010086
>>> [  198.594662] RAX: 0029 RBX: c9b8ff58 RCX: 
>>> 82868e38
>>> [  198.594663] RDX: 0001 RSI: 0001 RDI: 
>>> 0001
>>> [  198.594664] RBP: 880078623980 R08: 0dfa R09: 
>>> 063b
>>> [  198.594664] R10:  R11: 063b R12: 
>>> 
>>> [  198.594665] R13:  R14:  R15: 
>>> 
>>> [  198.594672] FS:  7fa252372b40() GS:88007d04() 
>>> knlGS:
>>> [  198.594673] CS:  e033 DS:  ES:  CR0: 80050033
>>> [  198.594674] CR2: f7f303e4 CR3: 02824000 CR4: 
>>> 0660
>>> [  198.594676] Call Trace:
>>> [  198.594683]  entry_INT80_compat+0x7f/0x90
>>> [  198.594685]  ? vunmap_page_range+0x2a0/0x340
>>> [  198.594686] Code: 03 7f 48 8b 75 00 f7 c6 0e 38 00 00 75 2e 83 65 08 f9 
>>> 5b 5d c3 e8

Re: [Xen-devel] Regression with commit "x86/pv: Drop int80_bounce from struct pv_vcpu" f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1

2018-03-12 Thread Sander Eikelenboom
On 12/03/18 21:04, Boris Ostrovsky wrote:
> On 03/12/2018 03:05 PM, Andrew Cooper wrote:
>> On 10/03/18 16:27, Andrew Cooper wrote:
>>> On 10/03/2018 16:14, Sander Eikelenboom wrote:
>>>> Hi Andrew,
>>>>
>>>> It seems commit "x86/pv: Drop int80_bounce from struct pv_vcpu" 
>>>> (f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1) causes an issue on my machine, 
>>>> an AMD phenom X6.
>>>>
>>>> When trying to installing a new kernel package which runs the Debian
>>>> update-initramfs tools with xen-unstable which happened to be at commit 
>>>> c9bd8a73656d7435b1055ee8825823aee995993e as last commit the tool stalls
>>>> and i get this kernel splat:
>>>>
>>>> [  284.910674] BUG: unable to handle kernel NULL pointer dereference at 
>>>> 
>>>> [  284.919696] IP:   (null)
>>>> [  284.928315] PGD 0 P4D 0 
>>>> [  284.943343] Oops: 0010 [#1] SMP NOPTI
>>>> [  284.957008] Modules linked in:
>>>> [  284.965521] CPU: 5 PID: 24729 Comm: ld-linux.so.2 Not tainted 
>>>> 4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
>>>> [  284.974154] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>>>> V1.8B1 09/13/2010
>>>> [  284.983198] RIP: e030:  (null)
>>>> [  284.992006] RSP: e02b:c90001497ed8 EFLAGS: 00010286
>>>> [  285.000612] RAX:  RBX: 880074c64500 RCX: 
>>>> 82f8d1c0
>>>> [  285.009122] RDX: 82f8d1c0 RSI: 20020002 RDI: 
>>>> 82f8d1c0
>>>> [  285.017598] RBP: 880074c64b7c R08:  R09: 
>>>> 
>>>> [  285.025999] R10:  R11:  R12: 
>>>> 82f8d1c0
>>>> [  285.034400] R13:  R14:  R15: 
>>>> 880074c64b50
>>>> [  285.042718] FS:  7f919fe2eb40() GS:88007d14() 
>>>> knlGS:
>>>> [  285.051001] CS:  e033 DS: 002b ES: 002b CR0: 80050033
>>>> [  285.059458] CR2:  CR3: 02824000 CR4: 
>>>> 0660
>>>> [  285.067813] Call Trace:
>>>> [  285.075947]  ? task_work_run+0x85/0xa0
>>>> [  285.084025]  ? exit_to_usermode_loop+0x72/0x80
>>>> [  285.091980]  ? do_int80_syscall_32+0xfe/0x120
>>>> [  285.099896]  ? entry_INT80_compat+0x7f/0x90
>>>> [  285.107688]  ? fpu__drop+0x23/0x40
>>>> [  285.115362] Code:  Bad RIP value.
>>>> [  285.123072] RIP:   (null) RSP: c90001497ed8
>>>> [  285.130714] CR2: 
>>>> [  285.138219] ---[ end trace 4d3317497f4ba022 ]---
>>>> [  285.145671] Fixing recursive fault but reboot is needed!
>>>>
>>>> After updating xen-unstable to the latest available commit 
>>>> 185413355fe331cbc926d48568838227234c9a20,
>>>> the tool doesn't stall anymore but i still get a kernel splat:
>>>>
>>>> [  198.594638] [ cut here ]
>>>> [  198.594641] Invalid address limit on user-mode return
>>>> [  198.594651] WARNING: CPU: 1 PID: 75 at ./include/linux/syscalls.h:236 
>>>> do_int80_syscall_32+0xe5/0x120
>>>> [  198.594652] Modules linked in:
>>>> [  198.594655] CPU: 1 PID: 75 Comm: kworker/1:1 Not tainted 
>>>> 4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
>>>> [  198.594656] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>>>> V1.8B1 09/13/2010
>>>> [  198.594658] Workqueue: events free_work
>>>> [  198.594660] RIP: e030:do_int80_syscall_32+0xe5/0x120
>>>> [  198.594661] RSP: e02b:c9b8ff40 EFLAGS: 00010086
>>>> [  198.594662] RAX: 0029 RBX: c9b8ff58 RCX: 
>>>> 82868e38
>>>> [  198.594663] RDX: 0001 RSI: 0001 RDI: 
>>>> 0001
>>>> [  198.594664] RBP: 880078623980 R08: 0dfa R09: 
>>>> 063b
>>>> [  198.594664] R10:  R11: 063b R12: 
>>>> 
>>>> [  198.594665] R13:  R14:  R15: 
>>>> 
>>>> [  198.594672] FS:  7fa252372b40() GS:88007d04() 
>>>> knlGS:
>>>> [  198.594673] CS:  e033 DS:  ES:  CR0: 80050033
>

Re: [Xen-devel] Regression with commit "x86/pv: Drop int80_bounce from struct pv_vcpu" f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1

2018-03-13 Thread Sander Eikelenboom
On 13/03/18 23:01, Andrew Cooper wrote:
> On 10/03/18 16:14, Sander Eikelenboom wrote:
>> Hi Andrew,
>>
>> It seems commit "x86/pv: Drop int80_bounce from struct pv_vcpu" 
>> (f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1) causes an issue on my machine, 
>> an AMD phenom X6.
>>
>> When trying to installing a new kernel package which runs the Debian
>> update-initramfs tools with xen-unstable which happened to be at commit 
>> c9bd8a73656d7435b1055ee8825823aee995993e as last commit the tool stalls
>> and i get this kernel splat:
>>
>> [  284.910674] BUG: unable to handle kernel NULL pointer dereference at 
>> 
>> [  284.919696] IP:   (null)
>> [  284.928315] PGD 0 P4D 0 
>> [  284.943343] Oops: 0010 [#1] SMP NOPTI
>> [  284.957008] Modules linked in:
>> [  284.965521] CPU: 5 PID: 24729 Comm: ld-linux.so.2 Not tainted 
>> 4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
>> [  284.974154] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>> V1.8B1 09/13/2010
>> [  284.983198] RIP: e030:  (null)
>> [  284.992006] RSP: e02b:c90001497ed8 EFLAGS: 00010286
>> [  285.000612] RAX:  RBX: 880074c64500 RCX: 
>> 82f8d1c0
>> [  285.009122] RDX: 82f8d1c0 RSI: 20020002 RDI: 
>> 82f8d1c0
>> [  285.017598] RBP: 880074c64b7c R08:  R09: 
>> 
>> [  285.025999] R10:  R11:  R12: 
>> 82f8d1c0
>> [  285.034400] R13:  R14:  R15: 
>> 880074c64b50
>> [  285.042718] FS:  7f919fe2eb40() GS:88007d14() 
>> knlGS:
>> [  285.051001] CS:  e033 DS: 002b ES: 002b CR0: 80050033
>> [  285.059458] CR2:  CR3: 02824000 CR4: 
>> 0660
>> [  285.067813] Call Trace:
>> [  285.075947]  ? task_work_run+0x85/0xa0
>> [  285.084025]  ? exit_to_usermode_loop+0x72/0x80
>> [  285.091980]  ? do_int80_syscall_32+0xfe/0x120
>> [  285.099896]  ? entry_INT80_compat+0x7f/0x90
>> [  285.107688]  ? fpu__drop+0x23/0x40
>> [  285.115362] Code:  Bad RIP value.
>> [  285.123072] RIP:   (null) RSP: c90001497ed8
>> [  285.130714] CR2: 
>> [  285.138219] ---[ end trace 4d3317497f4ba022 ]---
>> [  285.145671] Fixing recursive fault but reboot is needed!
>>
>> After updating xen-unstable to the latest available commit 
>> 185413355fe331cbc926d48568838227234c9a20,
>> the tool doesn't stall anymore but i still get a kernel splat:
>>
>> [  198.594638] [ cut here ]
>> [  198.594641] Invalid address limit on user-mode return
>> [  198.594651] WARNING: CPU: 1 PID: 75 at ./include/linux/syscalls.h:236 
>> do_int80_syscall_32+0xe5/0x120
>> [  198.594652] Modules linked in:
>> [  198.594655] CPU: 1 PID: 75 Comm: kworker/1:1 Not tainted 
>> 4.16.0-rc4-20180305-linus-pvhpatches-doflr+ #1
>> [  198.594656] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
>> V1.8B1 09/13/2010
>> [  198.594658] Workqueue: events free_work
>> [  198.594660] RIP: e030:do_int80_syscall_32+0xe5/0x120
>> [  198.594661] RSP: e02b:c9b8ff40 EFLAGS: 00010086
>> [  198.594662] RAX: 0029 RBX: c9b8ff58 RCX: 
>> 82868e38
>> [  198.594663] RDX: 0001 RSI: 0001 RDI: 
>> 0001
>> [  198.594664] RBP: 880078623980 R08: 0dfa R09: 
>> 063b
>> [  198.594664] R10:  R11: 063b R12: 
>> 
>> [  198.594665] R13:  R14:  R15: 
>> 
>> [  198.594672] FS:  7fa252372b40() GS:88007d04() 
>> knlGS:
>> [  198.594673] CS:  e033 DS:  ES:  CR0: 80050033
>> [  198.594674] CR2: f7f303e4 CR3: 02824000 CR4: 
>> 0660
>> [  198.594676] Call Trace:
>> [  198.594683]  entry_INT80_compat+0x7f/0x90
>> [  198.594685]  ? vunmap_page_range+0x2a0/0x340
>> [  198.594686] Code: 03 7f 48 8b 75 00 f7 c6 0e 38 00 00 75 2e 83 65 08 f9 
>> 5b 5d c3 e8 0c fb ff ff e9 53 ff ff ff 48 c7 c7 58 35 57 82 e8 ab 3e 0c 00 
>> <0f> 0b bf 09 00 00 00 48 89 ee e8 8c 00 0d 00 eb b8 48 89 df e8 
>> [  198.594706] ---[ end trace 90bcd2147bc825ef ]---
>>
>> After reverting commit f75b1a5247b3b311d3aa50de4c0e5f2d68085cb1 the issue is 
>> gone.
> 
> Can you try this patch?
Hi Andrew,

Testing with: ldd -v /lib/x86_6

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

2018-07-05 Thread Sander Eikelenboom
On 05/07/18 10:43, Roger Pau Monné wrote:
> On Thu, Jul 05, 2018 at 09:19:10AM +0100, Wei Liu wrote:
>> On Thu, Jul 05, 2018 at 10:06:52AM +0200, Roger Pau Monné wrote:
>>> On Thu, Jul 05, 2018 at 08:53:51AM +0100, Wei Liu wrote:
 On Wed, Jul 04, 2018 at 03:26:16PM +, George Dunlap wrote:
> So a fair amount of the discussion was about what it would look like,
> and what it would take, to make it such that almost any push from
> osstest (or whatever testing infrasctructure we went with) could
> reasonably be released, and would have a very low expectation of
> having extraneous bugs.

 I would also like to advocate changing the mentality a bit. The current
 mentality is that "we want to be reasonably sure there is low
 expectation of bugs before we can release". Why not change to "we
 release when we're sure there is definitely improvement in the tree
 compared to last release"?
>>>
>>> The current guideline is quite objective, if there are no reported
>>> bugs and osstest flight doesn't show any regressions we are ready to
>>> release. OTOH how should the improvements to the tree be quantized and
>>> measured?
>>
>> Say, a security bug is fixed? A major bug is closed?
> 
> I think this is still quite subjective, whereas the previous criteria
> was objective.
> 
> Who will take the decision of whether a bug is major or not?
> 
>>>
>>> At any point during the development or the release process the tree
>>> will contain improvements in some areas compared to the last
>>> release.
>>
>> Yes, that is right. That's what CD does, right?
> 
> I thin so, but I'm not an expert on development techniques TBH :).
> 
> IMO one of the problems with Xen is that users don't tend to test
> master often, I assume this is because Xen is a critical piece of
> their infra, and they require it to be completely stable. Not everyone
> can effort an extra box just for testing Xen master. I'm not sure this
> is going to change a lot even if nightly builds are provided.

Since i actually do (on my homeserver), just to share the experience:
- Master/xen-unstable is actually quite stable !
- Most issues i encounter are boot-issues and not uncommonly boot issues
  due to upstream linux kernel changes (by other developers) which have
  an unforeseen impact on Xen.
- So one of the issues with testing is other projects where Xen depends
  upon, and which are out of your control.
- But that's n=1 and on older hardware, so i don't run into issue
  with newer hardware-features.

One thing i haven't seen being mentioned regarding OSSTEST and testing
in general is the perhaps a bit unwieldy test matrix. For a lot of Xen
functionality there are at least 2 options.
I do understand that feature deprecation isn't an easy thing to do and
there are always reasons to keep stuff around and even if you do it, it
still takes time to reap the benefits, but the benefits could be:
- (a lot of) code cleanup, which eases development in general.
- much less building and testing that has to be done.
- easier documentation wise.
- in the end, easier for users as well.

But perhaps it won't hurt to have some discussion about whether, *why*
and until when, certain features/sub-systems are still worth to keep and
for how long (for example: qemu-trad <-> qemu-xen, PV <-> PVH, seabios
<-> rombios). Since deprecation generally takes a few releases i think
it could be wise to have a discussion about it upfront of every release
so an deprecation warning can be incorporated in that release and the
release notes if need be.

--
Sander

> This is different from say an email or IRC clients, where people don't
> mind that much using unstable versions, and so the development branch
> gets more testing even before the release process starts.
> 
> Roger.
> 
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

2018-07-05 Thread Sander Eikelenboom

Thursday, July 5, 2018, 5:14:39 PM, you wrote:

> Juergen Gross writes ("Re: [Xen-devel] [Notes for xen summit 2018 design 
> session] Process changes: is the 6 monthly release Cadence too short, 
> Security Process, ..."):
>> Same applies to the host: the base system (without the to be tested
>> component like qemu, xen, or whatever) could be installed just by
>> cloning a disk/partition/logical volume.

> Certainly it would be a bad idea to use anything *on the test host
> itself* as a basis for a subsequent test.  The previous test might
> have corrupted it.

> So that means that often, and at least from one test flight to the
> next, all of the base dom0 OS needs to be copied from somewhere else
> to the test host.  This is not currently as fast as it could be, but
> running d-i is not massively slower than something like FAI.

how about using (LVM) snapshotting (whch does COW) and drop the snapshots after 
a test ?
Only do a new OS install once a day/week (or point releas) and only after 
having 
an OSSTEST pass ? 
That should have fairly little overhead.

--
Sander

> To a fairly large extent, similar considerations apply to guest
> images.

>> Each image would run through the stages new->staging->stable:
>> 
>> - Each time a component is released an image is based on (e.g. a new
>>   mainline kernel) a new image is created by installing it. In case this
>>   succeeds, the image is moved to the staging area.

> This would happen a lot more often than you seem to image.  "Releaed"
> here really means "is updated in its appropriate git branch".

> Unless you think we should do our testing of Xen mainly with released
> versions of Linux stable branches (in which case, given how Linux
> stable branches are often broken, we might be long out of date), or
> our testing of Linux only with point releases of Xen, etc.

> The current approach is mostly to take the most recent
> tested-and-working git commit from each of the inputs.  This aspect of
> osstest generally works well, I think.

> Ian.




___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

2018-07-05 Thread Sander Eikelenboom
On 05/07/18 19:02, Ian Jackson wrote:
> Andrew Cooper writes ("Re: [Xen-devel] [Notes for xen summit 2018 design 
> session] Process changes: is the 6 monthly release Cadence too short, 
> Security Process, ..."):
>> XenRT, which is XenServers provisioning and testing system and install,
>> can deploy arbitrary builds of XenServer, or arbitrary builds of various
>> Linux distros in 10 minutes (although for distros, we limit our install
>> media to published point releases).  Google "10 minutes to Xen" for some
>> PR on this subject done back in the day!
> 
> osstest's d-i runs take more like 15 minutes.  As I say, this could be
> improved by using something like FAI, but by a factor of at most 2 I
> think.  Instead of working on that, I have been working on reusing an
> install when it is feasible to do so: specifically, after a passing
> job and when the host is to be reused by the same flight, with an
> identical configuration.  In my tests that saves about 50% of the host
> installs.  I haven't yet completed and deployed this.
> 
> Ian.
> 
Just wondering, are there any timing statistics kept for the OSStest
flights (and separate for building the various components and running
the individual tests ?). Or should they be parse-able from the logs kept ?

That could perhaps give some better insight in the average and variation
in time spent into all the components.
--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

2018-07-05 Thread Sander Eikelenboom
On 05/07/18 19:11, Ian Jackson wrote:
> Sander Eikelenboom writes ("Re: [Xen-devel] [Notes for xen summit 2018 design 
> session] Process changes: is the 6 monthly release Cadence too short, 
> Security Process, ..."):
>> Just wondering, are there any timing statistics kept for the OSStest
>> flights (and separate for building the various components and running
>> the individual tests ?). Or should they be parse-able from the logs kept ?
> 
> Yes.  The database has a started and stopped time_t for each test
> step.  That's where I got the ~15 mins number from.
> 
> Ian.
> 

And what time does a complete flight for a push would require at minimum ?
and is there much variation between flights (does a non push with some failing 
test require more or less time than a succesful one) ?

--
Sander

BTW:
The link following link: http://osstest.xs.citrite.net/~osstest/testlogs/logs
From the osstest mail with subject: "[Xen-devel] [xen-4.10-testing 
baseline-only test] 74937: tolerable FAIL" doesn't seem to work"

(Is that server not publicly accessible  ?)

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

2018-07-05 Thread Sander Eikelenboom
On 05/07/18 19:25, Ian Jackson wrote:
> George Dunlap writes ("Re: [Xen-devel] [Notes for xen summit 2018 design 
> session] Process changes: is the 6 monthly release Cadence too short, 
> Security Process, ..."):
>> I don’t really understand why you’re more worried about a test
>> corrupting a backup partition or LVM snapshot, than of a test
>> corrupting a filesystem even when the test actually passed.  I don’t
>> have the same experience you do, but it seems like random stuff left
>> over from a previous test — even if the test passes — would have
>> more of a chance of screwing up a future test than some sort of
>> corruption of an LVM snapshot, and even less so a backup partition.
> 
> The difference is that these are tests *in the same flight*.  That
> means they're testing the same software.
> 
> If test A passes, but corrupts the disk which is detected by test B
> because the host wasn't wiped in between, causing test B to fail, then
> that is a genuine test failure - albeit one whose repro conditions are
> complicated.  I'm betting that this will be rare enough not to matter.
> 
> Ian.
> 

I know assumption happens to be the mother of some children with a certain
"attitude", but
With LVM i think in practice the chances would be pretty minimal for corruption.
The most prominent test which could cause issues would be with a linux-linus 
(unstable) kernel.
Else there would be a very very grave bug in either a stable linux kernel or 
the hardware or the Xen used on the test flight which nukes something quite 
specific.
Since you don't use the LVM LV with the base image it self but only the 
snapshot and you recycle the snapshot every time.

The others points you mentioned about EFI etc. could be interesting though.

On the other hand using a setup with LVM wouldn't prohibit you from 
reinstalling the base image to LV on every flight just
you seem to suggest above. It does have the benefit of being able to keep it 
across flights with some seemingly simple adjustments
when in practice that would just seem to work (with always being able to revert 
to doing a new base image every flight).

But i will leave itat this for now, it was merely a suggestion :).

--
Sander


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

2018-07-05 Thread Sander Eikelenboom
On 05/07/18 19:11, Ian Jackson wrote:
> Sander Eikelenboom writes ("Re: [Xen-devel] [Notes for xen summit 2018 design 
> session] Process changes: is the 6 monthly release Cadence too short, 
> Security Process, ..."):
>> Just wondering, are there any timing statistics kept for the OSStest
>> flights (and separate for building the various components and running
>> the individual tests ?). Or should they be parse-able from the logs kept ?
> 
> Yes.  The database has a started and stopped time_t for each test
> step.  That's where I got the ~15 mins number from.
> 
> Ian.
> 

Hi Ian,

Since the current OSStest emails give a 404 on the link to the logs,
i digged in the archives and found the right url:
http://logs.test-lab.xenproject.org/osstest/logs/

I took the liberty to browse through some of the flights trying to get a grasp 
on how
to interpret the numbers.

Let't take an example: http://logs.test-lab.xenproject.org/osstest/logs/124946/
Started:2018-07-03 13:08:06 Z
Finished:   2018-07-05 06:08:54 Z

That is quite some time ...

Now if i take an example job/test say: 
http://logs.test-lab.xenproject.org/osstest/logs/124940/test-amd64-amd64-xl/info.html

I see:
- step 2 hosts-allocate takes 20012 seconds
  which if i interpret it right, indicates a lot of time waiting before 
actually having a slot available to run,
  so that seems to be indicating at least a capacity problem on the infra 
structure.
- Step 3 seems to be the elapsed time while syslog recorded all the steps 
thereafter.
  It's 2639 seconds, while the rest of the steps remaining give a sum of 2630, 
so that seems about right.

  All the other steps together take 2630 seconds, so the run to wait ratio is 
about 1/7 
  For the remainder let's keep the waiting out of the equation, under the 
assumption that if we can reduce the rest, 
  we reduce the load on the infrastructure and reduce the waiting time as well.
 
- step 4 host-install(4) takes 1005 seconds
  It seems step 4 is the step you referred to with the 15 minutes (it's indeed 
around 15 minutes) ?
  That is around 38% percent of all the steps (excluding the waiting from step 
2) !

- step 10 debian-install which seems to be the guest install, seems modest with 
288 seconds.

I also browsed some other tests and flights and on first sight it does seem the 
give the same pattern.

So (sometimes), a lot of time is spent on waiting for a slot, followed by doing 
the host install. 

So any improvement in the later will probably reap a double benefit by also 
reducing the wait time !


When i look at job/test: 
http://logs.test-lab.xenproject.org/osstest/logs/124940/test-amd64-amd64-xl-qemuu-win10-i386/info.html

I see:
- step 2 hosts-allocate: 47116 seconds.
- step 3 syslog-server: 8191 seconds.
- step 4 host-install(4): 789 seconds, somewhat shorter than the other job/test.
- step 10 windows-install 7061 seconds, but a failing windows 10 guest install 
dwarfs them all...


When i look at job/test: 
http://logs.test-lab.xenproject.org/osstest/logs/124940/test-amd64-amd64-xl-qemuu-win7-amd64/info.html

I see:
- step 2 hosts-allocate: 13272 seconds.
- step 3 syslog-server: 2985 seconds.
- step 4 host-install(4): 675 seconds, even somewhat shorter than both the 
other job/tests.
- step 10 windows-install 1029 seconds, that's a lot better than the failing 
windows 10 install from the other job.

So running the windows install is currently a black box with a timeout of 7000 
seconds.
If it fails the total runtime of the job/test is around 8000 seconds which is 
almost 2 hours !

Which we do 4 times: 
- test-amd64-amd64-xl-qemut-win10-i386
- test-amd64-i386-xl-qemut-win10-i386
- test-amd64-amd64-xl-qemuu-win10-i386
- test-amd64-i386-xl-qemuu-win10-i386

Which all seem to result in a "10. windows-install" -> "fail never pass".
I sincerely *hope* i'm not interpreting this correct .. but are we wasting 4 * 
2 hours = 8 hours in a flight, 
on a job/test that has *never ever* passed (and probably will never, miracles 
or a specific bugfix excluded) ?

Would it be an idea to only test "fail never pass" on install steps only every 
once in a while (they can't be blockers anyway ?)
if at all (only re-enable manually after fix?). If my interpretation is right 
this seems to be quite low hanging fruit.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Notes for xen summit 2018 design session] Process changes: is the 6 monthly release Cadence too short, Security Process, ...

2018-07-06 Thread Sander Eikelenboom
On 06/07/18 00:47, Sander Eikelenboom wrote:
> On 05/07/18 19:11, Ian Jackson wrote:
>> Sander Eikelenboom writes ("Re: [Xen-devel] [Notes for xen summit 2018 
>> design session] Process changes: is the 6 monthly release Cadence too short, 
>> Security Process, ..."):
>>> Just wondering, are there any timing statistics kept for the OSStest
>>> flights (and separate for building the various components and running
>>> the individual tests ?). Or should they be parse-able from the logs kept ?
>>
>> Yes.  The database has a started and stopped time_t for each test
>> step.  That's where I got the ~15 mins number from.
>>
>> Ian.
>>
> 
> Hi Ian,
> 
> Since the current OSStest emails give a 404 on the link to the logs,
> i digged in the archives and found the right url:
> http://logs.test-lab.xenproject.org/osstest/logs/
> 
> I took the liberty to browse through some of the flights trying to get a 
> grasp on how
> to interpret the numbers.
> 
> Let't take an example: 
> http://logs.test-lab.xenproject.org/osstest/logs/124946/
> Started:  2018-07-03 13:08:06 Z
> Finished: 2018-07-05 06:08:54 Z
> 
> That is quite some time ...
> 
> Now if i take an example job/test say: 
> http://logs.test-lab.xenproject.org/osstest/logs/124940/test-amd64-amd64-xl/info.html
> 
> I see:
> - step 2 hosts-allocate takes 20012 seconds
>   which if i interpret it right, indicates a lot of time waiting before 
> actually having a slot available to run,
>   so that seems to be indicating at least a capacity problem on the infra 
> structure.
> - Step 3 seems to be the elapsed time while syslog recorded all the steps 
> thereafter.
>   It's 2639 seconds, while the rest of the steps remaining give a sum of 
> 2630, so that seems about right.
> 
>   All the other steps together take 2630 seconds, so the run to wait ratio is 
> about 1/7 
>   For the remainder let's keep the waiting out of the equation, under the 
> assumption that if we can reduce the rest, 
>   we reduce the load on the infrastructure and reduce the waiting time as 
> well.
>  
> - step 4 host-install(4) takes 1005 seconds
>   It seems step 4 is the step you referred to with the 15 minutes (it's 
> indeed around 15 minutes) ?
>   That is around 38% percent of all the steps (excluding the waiting from 
> step 2) !
> 
> - step 10 debian-install which seems to be the guest install, seems modest 
> with 288 seconds.
> 
> I also browsed some other tests and flights and on first sight it does seem 
> the give the same pattern.
> 
> So (sometimes), a lot of time is spent on waiting for a slot, followed by 
> doing the host install. 
> 
> So any improvement in the later will probably reap a double benefit by also 
> reducing the wait time !
> 
> 
> When i look at job/test: 
> http://logs.test-lab.xenproject.org/osstest/logs/124940/test-amd64-amd64-xl-qemuu-win10-i386/info.html
> 
> I see:
> - step 2 hosts-allocate: 47116 seconds.
> - step 3 syslog-server: 8191 seconds.
> - step 4 host-install(4): 789 seconds, somewhat shorter than the other 
> job/test.
> - step 10 windows-install 7061 seconds, but a failing windows 10 guest 
> install dwarfs them all...
> 
> 
> When i look at job/test: 
> http://logs.test-lab.xenproject.org/osstest/logs/124940/test-amd64-amd64-xl-qemuu-win7-amd64/info.html
> 
> I see:
> - step 2 hosts-allocate: 13272 seconds.
> - step 3 syslog-server: 2985 seconds.
> - step 4 host-install(4): 675 seconds, even somewhat shorter than both the 
> other job/tests.
> - step 10 windows-install 1029 seconds, that's a lot better than the failing 
> windows 10 install from the other job.
> 
> So running the windows install is currently a black box with a timeout of 
> 7000 seconds.
> If it fails the total runtime of the job/test is around 8000 seconds which is 
> almost 2 hours !
> 
> Which we do 4 times: 
> - test-amd64-amd64-xl-qemut-win10-i386
> - test-amd64-i386-xl-qemut-win10-i386
> - test-amd64-amd64-xl-qemuu-win10-i386
> - test-amd64-i386-xl-qemuu-win10-i386
> 
> Which all seem to result in a "10. windows-install" -> "fail never pass".
> I sincerely *hope* i'm not interpreting this correct .. but are we wasting 4 
> * 2 hours = 8 hours in a flight, 
> on a job/test that has *never ever* passed (and probably will never, miracles 
> or a specific bugfix excluded) ?

This morning had another look and 
http://logs.test-lab.xenproject.org/osstest/logs/124940/test-amd64-amd64-xl-qemuu-win10-i386/fiano0_win.guest.osstest-vnc.jpeg
could indicate windows 10 has detected no NIC. Perhaps changing the emulated 
NIC type fro

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-17 Thread Sander Eikelenboom
Marek / Ian,

Nice to see PCI-passthrough getting some attention again.

On 17/05/18 17:12, Ian Jackson wrote:
> Marek Marczykowski-Górecki writes ("Re: Test for osstest, features used in 
> Qubes OS"):
>> On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
>>> Is it likely that this will depend on non-buggy host firmware ?  If so
>>> then we need to make arrangements to test it and only do it on hosts
>>> which are not buggy.  In practice this probably means wiring it up to
>>> the automatic host examiner.
>>
>> Yes, probably.
> 
> That's not entirely trivial then, especially for you, unless you want
> to set up your own osstest production instance.  However, I can
> probably do the osstest-machinery work if you will help debug it,
> review logs, tell me what to do next, etc. :-).
> 
>>> Is there some kind of cheap USB HID, that is interactable-with, which
>>> we could plug into each machine's USB port ?  I'm slightly concerned
>>> that plugging in a storage device, or connecting the other NIC, might
>>> interfere with booting.
>>
>> I use mass storage for tests... But if you use network boot, it
>> shouldn't really interfere, no?
> 
> We do both network boot and disk boot.  I think the BIOS disk boot has
> to continue to work and boot the HDD.

As a user of pci-passthrough for quite some time and reporting some 
pci-passthrough bugs in the past,
I do have some comments:

- First of all it would be very nice to get some autotesting :).
- But if you want to thoroughly test pci-passthrough, 
  it will be far from easy since there is quite a multi-dimensional support 
matrix
  (I'm not implying that everything should be done or it won't be valuable if 
any is missing,
   it's only meant for reference):
  1) Guest side implementation: 
 - PV guest (pcifront)
 - HVM (qemu-traditional) 
 - HVM (qemu-xen) 
 - HVM (qemu-upstream) 
 - perhaps PVH support for pci passthrough coming around the corner.

  2) (Un)Binding method to pciback:
 - binding pci devices to pciback on host boot (command line) 
 - de/re/unbinding devices from dom0 while running.
 
  3) (Un)binding to guest:
 - On guest start (guest.cfg pci=[...])
 - After the guest has been started with 'xl pci-*' commands
  3) Device interrupts: legacy versus MSI versus MSI-X
  4) Other pci device features: roms, BAR sizes, etc.
  5) AMD versus Intel IOMMU

From the past reports, I know (1) and (3) did matter (problems being isolated 
to one of these variants only).


As for restarting guests and reassigning pci-devices again to other guests the 
current pciback reset support lacks
the bus-reset patches at present in upstream linux kernels. Passthrough of AMD 
Radeon graphics adapters works only one
time without it (if you stop and restart a guest it doesn't work anymore and 
you need to reboot the host). 
With the bus-reset patches (which have been posted to the list and seem to be 
in both Qubes and Xenserver 
in some form but not in upstream linux). Someone from Oracle had picked them up 
to get them upstream some time ago,
but that effort seems to have stalled.

The code in libxl seems to be quite messy for pci-passthrough especially for 
handling all the guest side implementations (1)
and xenstore interactions that go with it (or don't for qemu).

--
Sander

 
>>> If you want to get pci passthrough tests working I would suggest
>>> testing it with non-stubdom first.  I assume the config etc. is the
>>> same, so having got that working, osstest would be able to test it for
>>> the stubdom tests too.
>>
>> Oh, I though there are already tests for that...
> 
> There are no PCI passthrough tests at all.  For a while we had some
> SRIOV NIC tests which were requested by Intel.  But they always failed
> giving kernel stack dumps.  We kept poking Intel to get them to fix
> them, or tell us how the tests were wrong, but to no avail.  So we
> dropped them.
> 
> So any work in this area would be greatly appreciated!
> 
> Ian.
> 
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 16:26, Jens Axboe wrote:
> On 9/27/18 1:12 AM, Juergen Gross wrote:
>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>> added support for purging persistent grants when they are not in use. As
>>> part of the purge, the grants were removed from the grant buffer, This
>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>> get_free_grant(). This can be observed even on an idle system, within
>>> 20-30 minutes.
>>>
>>> We should keep the grants in the buffer when purging, and only free the
>>> grant ref.
>>>
>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>> Signed-off-by: Boris Ostrovsky 
>>
>> Reviewed-by: Juergen Gross 
> 
> Since Konrad is out, I'm going to queue this up for 4.19.
> 

Hi Boris/Juergen.

Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch from 
Boris pulled on top. 
Unfortunately it made a VM hang (probably because it's rootFS is shuffled from 
under it's feet 
and it gave these in dom0 dmesg:

[ 9251.696090] xen-blkback: requesting a grant already in use
[ 9251.705861] xen-blkback: trying to add a gref that's already in the tree
[ 9251.715781] xen-blkback: requesting a grant already in use
[ 9251.725756] xen-blkback: trying to add a gref that's already in the tree
[ 9251.735698] xen-blkback: requesting a grant already in use
[ 9251.745573] xen-blkback: trying to add a gref that's already in the tree

The VM was a HVM with 4 vcpu's and 2 phy disks:
xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
persistent grants
xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
persistent grants


Currently i have been running 4.19-rc5 with xen-next on top and commit 
a46b53672b2c reverted,
for a couple of days. That seems to run stable for me (since it's a small box 
so i'm not hit
by what a46b53672b2c tried to fix.

If you can come up with a debug patch i can give that a spin tomorrow evening 
or in the weekend,
so we are hopefully still in time for the 4.19 release.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 21:06, Boris Ostrovsky wrote:
> On 9/27/18 2:56 PM, Jens Axboe wrote:
>> On 9/27/18 12:52 PM, Sander Eikelenboom wrote:
>>> On 27/09/18 16:26, Jens Axboe wrote:
>>>> On 9/27/18 1:12 AM, Juergen Gross wrote:
>>>>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>>>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>> added support for purging persistent grants when they are not in use. As
>>>>>> part of the purge, the grants were removed from the grant buffer, This
>>>>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>>>>> get_free_grant(). This can be observed even on an idle system, within
>>>>>> 20-30 minutes.
>>>>>>
>>>>>> We should keep the grants in the buffer when purging, and only free the
>>>>>> grant ref.
>>>>>>
>>>>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>> Signed-off-by: Boris Ostrovsky 
>>>>> Reviewed-by: Juergen Gross 
>>>> Since Konrad is out, I'm going to queue this up for 4.19.
>>>>
>>> Hi Boris/Juergen.
>>>
>>> Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch 
>>> from Boris pulled on top. 
>>> Unfortunately it made a VM hang (probably because it's rootFS is shuffled 
>>> from under it's feet 
> 
> What do you mean by "rootFS is shuffled from under it's feet " ?

Assumption that block-front getting borked and either a kernel crash or rootfs 
becoming mounted readonly. Didn't (try) to check though.

>>> and it gave these in dom0 dmesg:
>>>
>>> [ 9251.696090] xen-blkback: requesting a grant already in use
>>> [ 9251.705861] xen-blkback: trying to add a gref that's already in the tree
>>> [ 9251.715781] xen-blkback: requesting a grant already in use
>>> [ 9251.725756] xen-blkback: trying to add a gref that's already in the tree
>>> [ 9251.735698] xen-blkback: requesting a grant already in use
>>> [ 9251.745573] xen-blkback: trying to add a gref that's already in the tree
>>>
>>> The VM was a HVM with 4 vcpu's and 2 phy disks:
>>> xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
>>> persistent grants
>>> xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
>>> persistent grants
>>>
>>>
>>> Currently i have been running 4.19-rc5 with xen-next on top and commit
>>> a46b53672b2c reverted, for a couple of days. That seems to run stable
>>> for me (since it's a small box so i'm not hit by what a46b53672b2c
>>> tried to fix.
>>>
>>> If you can come up with a debug patch i can give that a spin tomorrow
>>> evening or in the weekend, so we are hopefully still in time for the
>>> 4.19 release.
>> At this late in the game, might make more sense to simply revert the
>> buggy commit.  Especially since what is currently out there doesn't fix
>> the issue for you.
Don't know if Boris or Juergen have a hunch about the issue, if not perhaps a 
revert is the best. 

> If decision is to revert then I think the whole series needs to be
> reverted.
> 
> -boris
> 

For Boris and Juergen:
Would it make sense to have an "xen-next" branch in the xen-tip tree that is:
- based on the previous stable kernel
- and has the for-linus branches for the upcoming kernel release on top;
- and has the pathes for net(-next) and block changes on top (since these don't 
go via the tree but only via mailing-list patches);
  (which are scattered, difficult to track and use for automated testing)
- and dependency patches for the above if necessary to be able to build.

So there is one branch that can be used to test ALL pending kernel related Xen 
patches and which could be used in OSStest without as
many potential false alarms as linux-next will have ?

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen/blkfront: When purging persistent grants, keep them in the buffer

2018-09-27 Thread Sander Eikelenboom
On 27/09/18 23:48, Boris Ostrovsky wrote:
> On 9/27/18 5:37 PM, Jens Axboe wrote:
>> On 9/27/18 2:33 PM, Sander Eikelenboom wrote:
>>> On 27/09/18 21:06, Boris Ostrovsky wrote:
>>>> On 9/27/18 2:56 PM, Jens Axboe wrote:
>>>>> On 9/27/18 12:52 PM, Sander Eikelenboom wrote:
>>>>>> On 27/09/18 16:26, Jens Axboe wrote:
>>>>>>> On 9/27/18 1:12 AM, Juergen Gross wrote:
>>>>>>>> On 22/09/18 21:55, Boris Ostrovsky wrote:
>>>>>>>>> Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>>>>> added support for purging persistent grants when they are not in use. 
>>>>>>>>> As
>>>>>>>>> part of the purge, the grants were removed from the grant buffer, This
>>>>>>>>> eventually causes the buffer to become empty, with BUG_ON triggered in
>>>>>>>>> get_free_grant(). This can be observed even on an idle system, within
>>>>>>>>> 20-30 minutes.
>>>>>>>>>
>>>>>>>>> We should keep the grants in the buffer when purging, and only free 
>>>>>>>>> the
>>>>>>>>> grant ref.
>>>>>>>>>
>>>>>>>>> Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants")
>>>>>>>>> Signed-off-by: Boris Ostrovsky 
>>>>>>>> Reviewed-by: Juergen Gross 
>>>>>>> Since Konrad is out, I'm going to queue this up for 4.19.
>>>>>>>
>>>>>> Hi Boris/Juergen.
>>>>>>
>>>>>> Last week i tested a linux-4.19-rc4 kernel with xen-next and this patch 
>>>>>> from Boris pulled on top. 
>>>>>> Unfortunately it made a VM hang (probably because it's rootFS is 
>>>>>> shuffled from under it's feet 
>>>> What do you mean by "rootFS is shuffled from under it's feet " ?
>>> Assumption that block-front getting borked and either a kernel crash or 
>>> rootfs becoming mounted readonly. Didn't (try) to check though.
>>>
>>>>>> and it gave these in dom0 dmesg:
>>>>>>
>>>>>> [ 9251.696090] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.705861] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>> [ 9251.715781] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.725756] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>> [ 9251.735698] xen-blkback: requesting a grant already in use
>>>>>> [ 9251.745573] xen-blkback: trying to add a gref that's already in the 
>>>>>> tree
>>>>>>
>>>>>> The VM was a HVM with 4 vcpu's and 2 phy disks:
>>>>>> xen-blkback: backend/vbd/14/768: using 4 queues, protocol 1 (x86_64-abi) 
>>>>>> persistent grants
>>>>>> xen-blkback: backend/vbd/14/832: using 4 queues, protocol 1 (x86_64-abi) 
>>>>>> persistent grants
>>>>>>
>>>>>>
>>>>>> Currently i have been running 4.19-rc5 with xen-next on top and commit
>>>>>> a46b53672b2c reverted, for a couple of days. That seems to run stable
>>>>>> for me (since it's a small box so i'm not hit by what a46b53672b2c
>>>>>> tried to fix.
>>>>>>
>>>>>> If you can come up with a debug patch i can give that a spin tomorrow
>>>>>> evening or in the weekend, so we are hopefully still in time for the
>>>>>> 4.19 release.
>>>>> At this late in the game, might make more sense to simply revert the
>>>>> buggy commit.  Especially since what is currently out there doesn't fix
>>>>> the issue for you.
>>> Don't know if Boris or Juergen have a hunch about the issue, if not
>>> perhaps a revert is the best.
>> Anyone? Unless I hear otherwise, I'll revert the series tomorrow.
> 
> Juergen may have something to say by tomorrow, but from my perspective,
> given that we are coming up on rc6 --- yes.
> 
> I looked at the patches again and didn't see anything obvious.
> 
> -boris

Could also be that what i hit is a latent bug, 
that is not caused by these patches but merely got uncovered by them.

xl dmesg also shows quite some:
(XEN) [2018-09-24 03:15:46.847] grant_table.c:1755:d14v0 Expanding d14 
grant table from 19 to 20 frames
(XEN) [2018-09-24 03:15:46.849] grant_table.c:1755:d14v0 Expanding d14 
grant table from 20 to 21 frames
(and has done that for ages on my box not leading to any direct problems to my 
knowledge)

I don't know if there could be related and something around the (persistent) 
grants for block devices could be leaking under some conditions?

--
Sander


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] libxl: put RSDP for PVH guest near 4GB

2018-02-19 Thread Sander Eikelenboom
On 24/01/18 16:26, George Dunlap wrote:
> On Wed, Jan 24, 2018 at 3:20 PM, Juergen Gross  wrote:
>> On 24/01/18 16:07, George Dunlap wrote:
>>> On Wed, Jan 24, 2018 at 2:10 PM, Boris Ostrovsky
>>>  wrote:
 On 01/24/2018 07:06 AM, Juergen Gross wrote:
> On 24/01/18 11:54, Roger Pau Monné wrote:
>> On Wed, Jan 24, 2018 at 10:42:39AM +, George Dunlap wrote:
>>> On Wed, Jan 24, 2018 at 2:41 AM, Boris Ostrovsky
>>>  wrote:
 On 01/18/2018 05:33 AM, Wei Liu wrote:
> On Thu, Jan 18, 2018 at 11:31:32AM +0100, Juergen Gross wrote:
>> Wei,
>>
>> On 01/12/17 15:14, Juergen Gross wrote:
>>> Instead of locating the RSDP table below 1MB put it just below 4GB
>>> like the rest of the ACPI tables in case of PVH guests. This will
>>> avoid punching more holes than necessary into the memory map.
>>>
>>> Signed-off-by: Juergen Gross 
>>> Acked-by: Wei Liu 
>> Mind applying this one?
> Don't worry, it is in my queue.
>
> Will come to this and other patches I accumulated soon.
>
> Wei.
 This requires kernel changes, doesn't it?

 https://lists.xenproject.org/archives/html/xen-devel/2017-12/msg00714.html

 And this series apparently never made it to the tree.

 PVH guests are broken now on staging.
>>> And the Linux side of PVH is officially supported now, right?


 AFAIK PVH is still considered a tech preview --- Linux or Xen.
>>>
>>> From SUPPORT.md:
>>>
>>> ### x86/PVH guest
>>>
>>> Status: Supported
>>>
>>> I was under the impression that PVH guest in Linux was complete and
>>> stable as of Linux 4.11.  If that's not true it should have been
>>> brought up during the 4.10 development cycle, where we declared PVH
>>> domUs as "supported".
>>
>> So what is the problem here?
>>
>> - current Linux can't be booted as PVH guest with xen-unstable due to
>>   a bug in Linux, patches for Linux are being worked on
>> - booting Linux as PVH guest with xen 4.10 is working
> 
> I was responding to Boris's claim that PVH is considered tech preview.
> I can't say anything one way or the other about PVH in Linux, but PVH
> in Xen is definitely now considered supported.
> 
> My subsequent response to Roger ("FWIW I can buy this argument") was
> meant to indicate I didn't have any more objection to the approach you
> guys were planning on taking.
> 
>  -George

L.S.,

Seems I lost track, is there any progress on this issue ?
(doesn't seem a fix has landed in 4.16-rc2 yet).

--
Sander


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] libxl: put RSDP for PVH guest near 4GB

2018-02-19 Thread Sander Eikelenboom
On 19/02/18 11:16, Juergen Gross wrote:
> On 19/02/18 10:47, Sander Eikelenboom wrote:
>> On 24/01/18 16:26, George Dunlap wrote:
>>> On Wed, Jan 24, 2018 at 3:20 PM, Juergen Gross  wrote:
>>>> On 24/01/18 16:07, George Dunlap wrote:
>>>>> On Wed, Jan 24, 2018 at 2:10 PM, Boris Ostrovsky
>>>>>  wrote:
>>>>>> On 01/24/2018 07:06 AM, Juergen Gross wrote:
>>>>>>> On 24/01/18 11:54, Roger Pau Monné wrote:
>>>>>>>> On Wed, Jan 24, 2018 at 10:42:39AM +, George Dunlap wrote:
>>>>>>>>> On Wed, Jan 24, 2018 at 2:41 AM, Boris Ostrovsky
>>>>>>>>>  wrote:
>>>>>>>>>> On 01/18/2018 05:33 AM, Wei Liu wrote:
>>>>>>>>>>> On Thu, Jan 18, 2018 at 11:31:32AM +0100, Juergen Gross wrote:
>>>>>>>>>>>> Wei,
>>>>>>>>>>>>
>>>>>>>>>>>> On 01/12/17 15:14, Juergen Gross wrote:
>>>>>>>>>>>>> Instead of locating the RSDP table below 1MB put it just below 4GB
>>>>>>>>>>>>> like the rest of the ACPI tables in case of PVH guests. This will
>>>>>>>>>>>>> avoid punching more holes than necessary into the memory map.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Juergen Gross 
>>>>>>>>>>>>> Acked-by: Wei Liu 
>>>>>>>>>>>> Mind applying this one?
>>>>>>>>>>> Don't worry, it is in my queue.
>>>>>>>>>>>
>>>>>>>>>>> Will come to this and other patches I accumulated soon.
>>>>>>>>>>>
>>>>>>>>>>> Wei.
>>>>>>>>>> This requires kernel changes, doesn't it?
>>>>>>>>>>
>>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2017-12/msg00714.html
>>>>>>>>>>
>>>>>>>>>> And this series apparently never made it to the tree.
>>>>>>>>>>
>>>>>>>>>> PVH guests are broken now on staging.
>>>>>>>>> And the Linux side of PVH is officially supported now, right?
>>>>>>
>>>>>>
>>>>>> AFAIK PVH is still considered a tech preview --- Linux or Xen.
>>>>>
>>>>> From SUPPORT.md:
>>>>>
>>>>> ### x86/PVH guest
>>>>>
>>>>> Status: Supported
>>>>>
>>>>> I was under the impression that PVH guest in Linux was complete and
>>>>> stable as of Linux 4.11.  If that's not true it should have been
>>>>> brought up during the 4.10 development cycle, where we declared PVH
>>>>> domUs as "supported".
>>>>
>>>> So what is the problem here?
>>>>
>>>> - current Linux can't be booted as PVH guest with xen-unstable due to
>>>>   a bug in Linux, patches for Linux are being worked on
>>>> - booting Linux as PVH guest with xen 4.10 is working
>>>
>>> I was responding to Boris's claim that PVH is considered tech preview.
>>> I can't say anything one way or the other about PVH in Linux, but PVH
>>> in Xen is definitely now considered supported.
>>>
>>> My subsequent response to Roger ("FWIW I can buy this argument") was
>>> meant to indicate I didn't have any more objection to the approach you
>>> guys were planning on taking.
>>>
>>>  -George
>>
>> L.S.,
>>
>> Seems I lost track, is there any progress on this issue ?
>> (doesn't seem a fix has landed in 4.16-rc2 yet).
> 
> Just sent a new patch series.

Just tested and it works fine here.

--
Sander

> 
> Juergen
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Guest soft lockups with "xen: make xen_qlock_wait() nestable"

2018-11-07 Thread Sander Eikelenboom
Hi Juergen / Boris,

Last week i tested Linux kernel 4.19.0 stable with the Xen "for-linus-4.20" 
branch pulled on top.
Unfortunately i was seeing guests lockup after some time, see below for the 
logging from one of the guest
which i was able to capture.
Reverting "xen: make xen_qlock_wait() nestable" 
7250f6d35681dfc44749d90598a2d51a118ce2b8,
made the lockups disappear.

These guests are stressed quite hard in both CPU and networking, 
so they are probably more susceptible to locking issues.

System is a AMD phenom x6, running Xen-unstable.

Any ideas ?

--
Sander


serveerstertje:~# xl console security
[ 6045.805396] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
[ml1:mon-front-i:20428]
[ 6045.826995] Modules linked in:
[ 6045.836310] CPU: 1 PID: 20428 Comm: ml1:mon-front-i Tainted: G L 
   4.19.0-20181022-doflr-xennext-vlan-ppp+ #1
[ 6045.865526] Hardware name: Xen HVM domU, BIOS 4.12-unstable 10/27/2018
[ 6045.882784] RIP: 0010:__pv_queued_spin_lock_slowpath+0xda/0x280
[ 6045.897019] Code: 44 41 bc 01 00 00 00 41 bd 00 01 00 00 3c 02 0f 94 c0 0f 
b6 c0 48 89 04 24 c6 45 44 00 ba 00 80 00 00 c6 43 01 01 eb 0b f3 90 <83> ea 01 
0f 84 49 01 00 00 0f b6 03 84 c0 75 ee 44 89 e8 f0 66 44
[ 6045.902111] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[ml1:mon-front-i:20429]
[ 6045.945207] RSP: :c90003ba7d38 EFLAGS: 0202
[ 6045.965743] Modules linked in:
[ 6045.965748] CPU: 2 PID: 20429 Comm: ml1:mon-front-i Tainted: G L 
   4.19.0-20181022-doflr-xennext-vlan-ppp+ #1
[ 6045.965748] Hardware name: Xen HVM domU, BIOS 4.12-unstable 10/27/2018
[ 6045.965753] RIP: 0010:smp_call_function_many+0x1db/0x240
[ 6045.965756] Code: c7 e8 99 ab d1 00 3b 05 67 2e a0 01 0f 83 a3 fe ff ff 48 
63 d0 48 8b 0b 48 03 0c d5 00 47 8a 82 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 
83 e2 01 75 f6 eb c8 48 c7 c2 a0 e0 b3 82 48 89 ee 89 df
[ 6045.965757] RSP: :c90003bafc70 EFLAGS: 0202 ORIG_RAX: 
ff0c
[ 6045.980003]  ORIG_RAX: ff0c
[ 6045.988675] RAX:  RBX: 88007bd21d80 RCX: 88007bc25ae0
[ 6045.995387] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! 
[nc4:mon-front-i:3291]
[ 6045.995388] Modules linked in:
[ 6045.995392] CPU: 3 PID: 3291 Comm: nc4:mon-front-i Tainted: G L  
  4.19.0-20181022-doflr-xennext-vlan-ppp+ #1
[ 6045.995392] Hardware name: Xen HVM domU, BIOS 4.12-unstable 10/27/2018
[ 6045.995397] RIP: 0010:smp_call_function_many+0x1db/0x240
[ 6045.995400] Code: c7 e8 99 ab d1 00 3b 05 67 2e a0 01 0f 83 a3 fe ff ff 48 
63 d0 48 8b 0b 48 03 0c d5 00 47 8a 82 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 
83 e2 01 75 f6 eb c8 48 c7 c2 a0 e0 b3 82 48 89 ee 89 df
[ 6045.995401] RSP: :c90002307c70 EFLAGS: 0202 ORIG_RAX: 
ff0c
[ 6045.995402] RAX:  RBX: 88007bda1d80 RCX: 88007bc25c20
[ 6045.995403] RDX: 0001 RSI:  RDI: 88007bda1d88
[ 6045.995403] RBP: 88007bda1d88 R08:  R09: 
[ 6045.995404] R10:  R11: 0040 R12: 81057cb0
[ 6045.995404] R13: c90002307cc0 R14: 0001 R15: 0006
[ 6045.995413] FS:  7f4eafa0f700() GS:88007bd8() 
knlGS:
[ 6045.995414] CS:  0010 DS:  ES:  CR0: 80050033
[ 6045.995415] CR2: 7f4eaca182c0 CR3: 7462a000 CR4: 06e0
[ 6045.995416] Call Trace:
[ 6045.995422]  flush_tlb_mm_range+0xb7/0x120
[ 6045.995425]  ? ptep_clear_flush+0x30/0x40
[ 6045.995427]  ? mem_cgroup_throttle_swaprate+0x12/0x110
[ 6045.995429]  ? mem_cgroup_try_charge_delay+0x2c/0x40
[ 6045.995430]  ptep_clear_flush+0x30/0x40
[ 6045.995432]  wp_page_copy+0x311/0x6c0
[ 6045.995434]  do_wp_page+0x111/0x4c0
[ 6045.995435]  __handle_mm_fault+0x445/0xbd0
[ 6045.995437]  handle_mm_fault+0xf8/0x200
[ 6045.995438]  __do_page_fault+0x231/0x460
[ 6045.995441]  ? page_fault+0x8/0x30
[ 6045.995441]  page_fault+0x1e/0x30
[ 6045.995443] RIP: 0033:0x7f4ee1243476
[ 6045.995444] Code: 87 f3 c3 90 4c 8d 04 52 f3 0f 6f 06 f3 0f 6f 0c 16 f3 0f 
6f 14 56 f3 42 0f 6f 1c 06 66 0f 7f 07 66 0f 7f 0c 17 66 0f 7f 14 57 <66> 42 0f 
7f 1c 07 83 e9 04 48 8d 34 96 48 8d 3c 97 75 cb f3 c3 90
[ 6045.995445] RSP: 002b:7f4eafa0df38 EFLAGS: 00010202
[ 6045.995446] RAX:  RBX: 7f4ea8014500 RCX: 0008
[ 6045.995447] RDX: 0780 RSI: 7f4ea8287500 RDI: 7f4eaca16c40
[ 6045.995447] RBP: 0d80 R08: 1680 R09: 0360
[ 6045.995448] R10:  R11: 0440 R12: 
[ 6045.995448] R13:  R14: 0440 R15: 
[ 6046.017424] RAX: 0001 RBX: ea0001d194e8 RCX: 88007b06e000
[ 6046.017428] RDX: 3c23 RSI: 003b RDI: 0246
[ 6046.036278] RDX: 0001 RSI:  RDI: 88007bd21d88
[ 6046.036279] RBP: 88007bd21d88 R08: 0

Re: [Xen-devel] Guest soft lockups with "xen: make xen_qlock_wait() nestable"

2018-11-07 Thread Sander Eikelenboom
On 07/11/18 23:34, Boris Ostrovsky wrote:
> On 11/7/18 4:30 AM, Sander Eikelenboom wrote:
>> Hi Juergen / Boris,
>>
>> Last week i tested Linux kernel 4.19.0 stable with the Xen "for-linus-4.20" 
>> branch pulled on top.
>> Unfortunately i was seeing guests lockup after some time, see below for the 
>> logging from one of the guest
>> which i was able to capture.
>> Reverting "xen: make xen_qlock_wait() nestable" 
>> 7250f6d35681dfc44749d90598a2d51a118ce2b8,
>> made the lockups disappear.
>>
>> These guests are stressed quite hard in both CPU and networking, 
>> so they are probably more susceptible to locking issues.
>>
>> System is a AMD phenom x6, running Xen-unstable.
>>
>> Any ideas ?
> 
> 
> By any chance, is VMPU on?
> 
> 
> -boris
> 

Had to look up what that is :), but seems only applicable to PV guests if i'm 
correct ?

I'm only running PVH and HVM guests at the moment, except for dom0 of course,
which reports:
[0.941407] VPMU disabled by hypervisor.

These soft lockups were in a HVM guest (if i remember correctly, i have seen
a PVH guest lockup as well after a while (also a quite heavy cpu/network 
stressed one).

--
Sander



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] OSStest: test-amd64-amd64-xl-qemuu-win10-i386 windows-install fail never pass.

2018-11-07 Thread Sander Eikelenboom
Hi Ian,

I just tested a manual install of a Win10-1703-x86 iso, with about as much the 
same 
xen config as the OSStest has. 
The install succeeds in about an hour on my machine, with networking (DHCP) up 
and running,
without the need for any extra drivers. 

So i haven't ran into the issue where OSStest seems stuck, getting the network 
up:
http://logs.test-lab.xenproject.org/osstest/logs/129426/test-amd64-amd64-xl-qemuu-win10-i386/baroque1_win.guest.osstest-vnc.jpeg

So that seems to leave the unattended install image itself as the most prominent
culprit candidate. But since the unattended install image isn't public i can't 
help
you any further with that.

Since quite some OSStest resources are spend on these tests, i think it's 
worthwhile to
either have them fixed, or disable them for the time being if you can't spare 
some time 
to look into it.

--
Sander 



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Guest soft lockups with "xen: make xen_qlock_wait() nestable"

2018-11-08 Thread Sander Eikelenboom
On 08/11/18 08:08, Juergen Gross wrote:
> On 07/11/2018 10:30, Sander Eikelenboom wrote:
>> Hi Juergen / Boris,
>>
>> Last week i tested Linux kernel 4.19.0 stable with the Xen "for-linus-4.20" 
>> branch pulled on top.
>> Unfortunately i was seeing guests lockup after some time, see below for the 
>> logging from one of the guest
>> which i was able to capture.
>> Reverting "xen: make xen_qlock_wait() nestable" 
>> 7250f6d35681dfc44749d90598a2d51a118ce2b8,
>> made the lockups disappear.
>>
>> These guests are stressed quite hard in both CPU and networking, 
>> so they are probably more susceptible to locking issues.
>>
>> System is a AMD phenom x6, running Xen-unstable.
>>
>> Any ideas ?
> 
> Just checked the hypervisor again: it seems a pending interrupt for a
> HVM/PVH vcpu won't let SCHEDOP_poll return in case interrupts are
> disabled.
> 
> I need to rework the patch for that scenario. Until then I'll revert
> it.

Thanks for looking into it.

--
Sander

> 
> Juergen
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Guest soft lockups with "xen: make xen_qlock_wait() nestable"

2018-11-08 Thread Sander Eikelenboom
On 08/11/18 09:18, Juergen Gross wrote:
> On 08/11/2018 09:14, Sander Eikelenboom wrote:
>> On 08/11/18 08:08, Juergen Gross wrote:
>>> On 07/11/2018 10:30, Sander Eikelenboom wrote:
>>>> Hi Juergen / Boris,
>>>>
>>>> Last week i tested Linux kernel 4.19.0 stable with the Xen 
>>>> "for-linus-4.20" branch pulled on top.
>>>> Unfortunately i was seeing guests lockup after some time, see below for 
>>>> the logging from one of the guest
>>>> which i was able to capture.
>>>> Reverting "xen: make xen_qlock_wait() nestable" 
>>>> 7250f6d35681dfc44749d90598a2d51a118ce2b8,
>>>> made the lockups disappear.
>>>>
>>>> These guests are stressed quite hard in both CPU and networking, 
>>>> so they are probably more susceptible to locking issues.
>>>>
>>>> System is a AMD phenom x6, running Xen-unstable.
>>>>
>>>> Any ideas ?
>>>
>>> Just checked the hypervisor again: it seems a pending interrupt for a
>>> HVM/PVH vcpu won't let SCHEDOP_poll return in case interrupts are
>>> disabled.
>>>
>>> I need to rework the patch for that scenario. Until then I'll revert
>>> it.
>>
>> Thanks for looking into it.
> 
> Could you try the attached patch (on top of 7250f6d35681df)?

That blows up while booting the guest:

[1.792870] installing Xen timer for CPU 1
[1.796171] x86: Booting SMP configuration:
[1.799410]  node  #0, CPUs:  #1
[1.882922] cpu 1 spinlock event irq 59
[1.899446] installing Xen timer for CPU 2
[1.902864]  #2
[1.986248] cpu 2 spinlock event irq 65
[1.996200] installing Xen timer for CPU 3
[1.999522]  #3
[2.082921] cpu 3 spinlock event irq 71
[2.092749] smp: Brought up 1 node, 4 CPUs
[2.096079] smpboot: Max logical packages: 1
[2.099410] smpboot: Total of 4 processors activated (25688.36 BogoMIPS)
[2.102893] BUG: unable to handle kernel paging request at 00014f90
[2.106063] PGD 0 P4D 0 
[2.106063] Oops: 0002 [#1] SMP NOPTI
[2.106063] CPU: 1 PID: 16 Comm: migration/1 Not tainted 
4.19.0-20181108-doflr-xennext-vlan-ppp-blkmq-qlockpatch+ #1
[2.106063] Hardware name: Xen HVM domU, BIOS 4.12-unstable 10/30/2018
[2.106063] RIP: 0010:xen_qlock_wait+0x23/0x70
[2.106063] Code: 1f 84 00 00 00 00 00 55 53 48 83 ec 08 65 8b 2d 63 33 ff 
7e 83 fd ff 74 32 65 8b 05 47 3f ff 7e a9 00 00 10 00 75 24 48 89 fb  ff 05 
36 33 ff 7e 8b 05 30 33 ff 7e 83 f8 01 74 16 0f b6 03 40
[2.106063] RSP: 0018:c96d3dc0 EFLAGS: 00010046
[2.106063] RAX: 8001 RBX: 831a5a68 RCX: 0008
[2.106063] RDX: 88010f7ef700 RSI: 0003 RDI: 831a5a68
[2.106063] RBP: 003b R08: 0008 R09: 006c
[2.106063] R10:  R11:  R12: 0001
[2.106063] R13: 0100 R14:  R15: 0008
[2.106063] FS:  () GS:88010b28() 
knlGS:
[2.106063] CS:  0010 DS:  ES:  CR0: 80050033
[2.106063] CR2: 00014f90 CR3: 02a24000 CR4: 06e0
[2.106063] Call Trace:
[2.106063]  ? __switch_to_asm+0x40/0x70
[2.106063]  __pv_queued_spin_lock_slowpath+0x248/0x280
[2.106063]  _raw_spin_lock+0x18/0x20
[2.106063]  prepare_set+0xc/0x90
[2.106063]  generic_set_all+0x26/0x2e0
[2.106063]  ? __switch_to_asm+0x40/0x70
[2.106063]  mtrr_rendezvous_handler+0x34/0x60
[2.106063]  multi_cpu_stop+0xb6/0xe0
[2.106063]  ? cpu_stop_queue_work+0xd0/0xd0
[2.106063]  cpu_stopper_thread+0x86/0x100
[2.106063]  smpboot_thread_fn+0x109/0x160
[2.106063]  kthread+0xee/0x120
[2.106063]  ? sort_range+0x20/0x20
[2.106063]  ? kthread_park+0x80/0x80
[2.106063]  ret_from_fork+0x22/0x40
[2.106063] Modules linked in:
[2.106063] CR2: 00014f90
[2.106063] BUG: unable to handle kernel paging request at 00014f90
[2.106063] ---[ end trace e5be82cfc3e40a5e ]---
[2.106063] PGD 0 
[2.106063] RIP: 0010:xen_qlock_wait+0x23/0x70
[2.106063] P4D 0 
[2.106063] Code: 1f 84 00 00 00 00 00 55 53 48 83 ec 08 65 8b 2d 63 33 ff 
7e 83 fd ff 74 32 65 8b 05 47 3f ff 7e a9 00 00 10 00 75 24 48 89 fb  ff 05 
36 33 ff 7e 8b 05 30 33 ff 7e 83 f8 01 74 16 0f b6 03 40
[2.106063] Oops: 0002 [#2] SMP NOPTI
[2.106063] RSP: 0018:c96d3dc0 EFLAGS: 00010046


> 
> Juergen
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Guest soft lockups with "xen: make xen_qlock_wait() nestable"

2018-11-08 Thread Sander Eikelenboom
On 08/11/18 11:18, Juergen Gross wrote:
> On 08/11/2018 10:57, Sander Eikelenboom wrote:
>> On 08/11/18 09:18, Juergen Gross wrote:
>>> On 08/11/2018 09:14, Sander Eikelenboom wrote:
>>>> On 08/11/18 08:08, Juergen Gross wrote:
>>>>> On 07/11/2018 10:30, Sander Eikelenboom wrote:
>>>>>> Hi Juergen / Boris,
>>>>>>
>>>>>> Last week i tested Linux kernel 4.19.0 stable with the Xen 
>>>>>> "for-linus-4.20" branch pulled on top.
>>>>>> Unfortunately i was seeing guests lockup after some time, see below for 
>>>>>> the logging from one of the guest
>>>>>> which i was able to capture.
>>>>>> Reverting "xen: make xen_qlock_wait() nestable" 
>>>>>> 7250f6d35681dfc44749d90598a2d51a118ce2b8,
>>>>>> made the lockups disappear.
>>>>>>
>>>>>> These guests are stressed quite hard in both CPU and networking, 
>>>>>> so they are probably more susceptible to locking issues.
>>>>>>
>>>>>> System is a AMD phenom x6, running Xen-unstable.
>>>>>>
>>>>>> Any ideas ?
>>>>>
>>>>> Just checked the hypervisor again: it seems a pending interrupt for a
>>>>> HVM/PVH vcpu won't let SCHEDOP_poll return in case interrupts are
>>>>> disabled.
>>>>>
>>>>> I need to rework the patch for that scenario. Until then I'll revert
>>>>> it.
>>>>
>>>> Thanks for looking into it.
>>>
>>> Could you try the attached patch (on top of 7250f6d35681df)?
>>
>> That blows up while booting the guest:
> 
> Oh, sorry. Of course it does. Dereferencing a percpu variable
> directly can't work. How silly of me.
> 
> The attached variant should repair that. Tested to not break booting.

This one boots. Will report back when either I find issues or
when I'm comfortable enough to give a "Tested-by" in a few days.

Thanks again.

--
Sander


> 
> Juergen
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen: fix xen_qlock_wait()

2018-11-09 Thread Sander Eikelenboom
On 09/11/18 13:04, Juergen Gross wrote:
> Commit a856531951dc80 ("xen: make xen_qlock_wait() nestable")
> introduced a regression for Xen guests running fully virtualized
> (HVM or PVH mode). The Xen hypervisor wouldn't return from the poll
> hypercall with interrupts disabled in case of an interrupt (for PV
> guests it does).
> 
> So instead of disabling interrupts in xen_qlock_wait() use a nesting
> counter to avoid calling xen_clear_irq_pending() in case
> xen_qlock_wait() is nested.
> 
> Fixes: a856531951dc80 ("xen: make xen_qlock_wait() nestable")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Juergen Gross 

Although you don't seem too interested, you can stick on a:
Tested-by: Sander Eikelenboom 
if you like.

--
Sander

> ---
>  arch/x86/xen/spinlock.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 441c88262169..1c8a8816a402 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -21,6 +22,7 @@
>  
>  static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
>  static DEFINE_PER_CPU(char *, irq_name);
> +static DEFINE_PER_CPU(atomic_t, xen_qlock_wait_nest);
>  static bool xen_pvspin = true;
>  
>  static void xen_qlock_kick(int cpu)
> @@ -39,25 +41,25 @@ static void xen_qlock_kick(int cpu)
>   */
>  static void xen_qlock_wait(u8 *byte, u8 val)
>  {
> - unsigned long flags;
>   int irq = __this_cpu_read(lock_kicker_irq);
> + atomic_t *nest_cnt = this_cpu_ptr(&xen_qlock_wait_nest);
>  
>   /* If kicker interrupts not initialized yet, just spin */
>   if (irq == -1 || in_nmi())
>   return;
>  
> - /* Guard against reentry. */
> - local_irq_save(flags);
> + /* Detect reentry. */
> + atomic_inc(nest_cnt);
>  
> - /* If irq pending already clear it. */
> - if (xen_test_irq_pending(irq)) {
> + /* If irq pending already and no nested call clear it. */
> + if (atomic_read(nest_cnt) == 1 && xen_test_irq_pending(irq)) {
>   xen_clear_irq_pending(irq);
>   } else if (READ_ONCE(*byte) == val) {
>   /* Block until irq becomes pending (or a spurious wakeup) */
>   xen_poll_irq(irq);
>   }
>  
> - local_irq_restore(flags);
> + atomic_dec(nest_cnt);
>  }
>  
>  static irqreturn_t dummy_handler(int irq, void *dev_id)
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen: fix xen_qlock_wait()

2018-11-09 Thread Sander Eikelenboom
On 09/11/18 16:20, Juergen Gross wrote:
> On 09/11/2018 16:02, Sander Eikelenboom wrote:
>> On 09/11/18 13:04, Juergen Gross wrote:
>>> Commit a856531951dc80 ("xen: make xen_qlock_wait() nestable")
>>> introduced a regression for Xen guests running fully virtualized
>>> (HVM or PVH mode). The Xen hypervisor wouldn't return from the poll
>>> hypercall with interrupts disabled in case of an interrupt (for PV
>>> guests it does).
>>>
>>> So instead of disabling interrupts in xen_qlock_wait() use a nesting
>>> counter to avoid calling xen_clear_irq_pending() in case
>>> xen_qlock_wait() is nested.
>>>
>>> Fixes: a856531951dc80 ("xen: make xen_qlock_wait() nestable")
>>> Cc: sta...@vger.kernel.org
>>> Signed-off-by: Juergen Gross 
>>
>> Although you don't seem too interested, you can stick on a:
>> Tested-by: Sander Eikelenboom 
>> if you like.
> 
> I am interested.
> 
> OTOH I wanted to post the patch officially to give others the chance to
> send remarks.

OK, would be nice to at least be CC'ed on patches going upstream when
you have been reporting stuff.

--
Sander


> 
> Juergen
> 
>>
>> --
>> Sander
>>
>>> ---
>>>  arch/x86/xen/spinlock.c | 14 --
>>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
>>> index 441c88262169..1c8a8816a402 100644
>>> --- a/arch/x86/xen/spinlock.c
>>> +++ b/arch/x86/xen/spinlock.c
>>> @@ -9,6 +9,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>> @@ -21,6 +22,7 @@
>>>  
>>>  static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
>>>  static DEFINE_PER_CPU(char *, irq_name);
>>> +static DEFINE_PER_CPU(atomic_t, xen_qlock_wait_nest);
>>>  static bool xen_pvspin = true;
>>>  
>>>  static void xen_qlock_kick(int cpu)
>>> @@ -39,25 +41,25 @@ static void xen_qlock_kick(int cpu)
>>>   */
>>>  static void xen_qlock_wait(u8 *byte, u8 val)
>>>  {
>>> -   unsigned long flags;
>>> int irq = __this_cpu_read(lock_kicker_irq);
>>> +   atomic_t *nest_cnt = this_cpu_ptr(&xen_qlock_wait_nest);
>>>  
>>> /* If kicker interrupts not initialized yet, just spin */
>>> if (irq == -1 || in_nmi())
>>> return;
>>>  
>>> -   /* Guard against reentry. */
>>> -   local_irq_save(flags);
>>> +   /* Detect reentry. */
>>> +   atomic_inc(nest_cnt);
>>>  
>>> -   /* If irq pending already clear it. */
>>> -   if (xen_test_irq_pending(irq)) {
>>> +   /* If irq pending already and no nested call clear it. */
>>> +   if (atomic_read(nest_cnt) == 1 && xen_test_irq_pending(irq)) {
>>> xen_clear_irq_pending(irq);
>>> } else if (READ_ONCE(*byte) == val) {
>>> /* Block until irq becomes pending (or a spurious wakeup) */
>>> xen_poll_irq(irq);
>>> }
>>>  
>>> -   local_irq_restore(flags);
>>> +   atomic_dec(nest_cnt);
>>>  }
>>>  
>>>  static irqreturn_t dummy_handler(int irq, void *dev_id)
>>>
>>
>>
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Minios-devel] FOSDEM Devrooms (CfP deadlines for relevant DevRooms from Dec 1-10) and Xen Project Stand

2018-11-23 Thread Sander Eikelenboom
On 23/11/18 14:34, Lars Kurth wrote:
> FYI: no Xen Project booth at FOSDEM this year

Bummer, no fresh T-shirt :(.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Linux 4.15-rc6 + xen-unstable: BUG: unable to handle kernel NULL pointer dereference at (null), [ 0.000000] IP: zero_resv_unavail+0x8e/0xe1

2018-01-04 Thread Sander Eikelenboom
On 04/01/18 12:44, Juergen Gross wrote:
> On 04/01/18 11:17, Sander Eikelenboom wrote:
>> Hi Boris / Juergen,
>>
>> First of all best wishes for a quite turbulent starting new year.
>>
>> Now the holidays are over I finally gotten to test a linux 4.15-rc6 kernel
>> and experienced a crash in early dom0 boot on my system (AMD phenom x6).
>>
>> I tested some earlier linux 4.15 rc's but experienced crashes then as well, 
>> but didn't have time to setup serial console to send them in 
>> (and waited to see if the issue Boris fixed with AMD PCI 64bit bar's could 
>> be it). 
>>
>> But since that patch went in before 4.15 rc6, that doesn't seem to be the 
>> issue. 
>> So it could be that the culprit went in pretty earlier in the 4.15 cycle.
>>
>> The 4.15-rc6 kernel boots fine on bare metal, as does a 4.14.6 kernel on 
>> xen-unstable.
>>
>> Hopefully you have a pointer to what is wrong, if not i can try to do a 
>> bisect.
> 
> A bisect would be very welcome.

Hi Juergen / Boris / Pavel,

Bisection result is:

a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b is the first bad commit
commit a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b
Author: Pavel Tatashin 
Date:   Wed Nov 15 17:36:31 2017 -0800

mm: zero reserved and unavailable struct pages

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().

In some cases these struct pages are accessed even if they do not
contain any data.  One example is page_to_pfn() might access page->flags
if this is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the
exiting memory from pfn 1 (i.e.  KVM).

Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

===

Here is more detailed example of problem that this patch is addressing:

Run tested on qemu with the following arguments:

-enable-kvm -cpu kvm64 -m 512 -smp 2

This patch reports that there are 98 unavailable pages.

They are: pfn 0 and pfns in range [159, 255].

Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.

e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]

Notice, that exactly unavailable pfns are missing!

Now, lets check what we have in zone 0: [1, 131039]

pfn 0, is not part of the zone, but pfns [1, 158], are.

However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug.  Because, that path operates at 2M
boundaries (section_nr).  And checks if 2M range of pages is hot
removable.  It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable.  In this case
start with pfn 1 and convert it down to pfn 0.  Later pfn is converted
to struct page, and some fields are checked.  Now, if we do not zero
struct pages, we get unpredictable results.

In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
vmemmap memory to ones, the following panic is observed with kernel test
without this patch applied:

  BUG: unable to handle kernel NULL pointer dereference at  (null)
  IP: is_pageblock_removable_nolock+0x35/0x90
  PGD 0 P4D 0
  Oops:  [#1] PREEMPT
  ...
  task: 88001f4e2900 task.stack: c9314000
  RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
  Call Trace:
   ? is_mem_section_removable+0x5a/0xd0
   show_mem_removable+0x6b/0xa0
   dev_attr_show+0x1b/0x50
   sysfs_kf_seq_show+0xa1/0x100
   kernfs_seq_show+0x22/0x30
   seq_read+0x1ac/0x3a0
   kernfs_fop_read+0x36/0x190
   ? security_file_permission+0x90/0xb0
   __vfs_read+0x16/0x30
   vfs_read+0x81/0x130
   SyS_read+0x44/0xa0
   entry_SYSCALL_64_fastpath+0x1f/0xbd

Link: 
http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatas...@oracle.com
Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven 

Re: [Xen-devel] Linux 4.15-rc6 + xen-unstable: BUG: unable to handle kernel NULL pointer dereference at (null), [ 0.000000] IP: zero_resv_unavail+0x8e/0xe1

2018-01-09 Thread Sander Eikelenboom
Since it's already rc7:
"Give me a subtle ping, Vasili. One subtle ping only, please."

On 04/01/18 21:02, Sander Eikelenboom wrote:
> On 04/01/18 12:44, Juergen Gross wrote:
>> On 04/01/18 11:17, Sander Eikelenboom wrote:
>>> Hi Boris / Juergen,
>>>
>>> First of all best wishes for a quite turbulent starting new year.
>>>
>>> Now the holidays are over I finally gotten to test a linux 4.15-rc6 kernel
>>> and experienced a crash in early dom0 boot on my system (AMD phenom x6).
>>>
>>> I tested some earlier linux 4.15 rc's but experienced crashes then as well, 
>>> but didn't have time to setup serial console to send them in 
>>> (and waited to see if the issue Boris fixed with AMD PCI 64bit bar's could 
>>> be it). 
>>>
>>> But since that patch went in before 4.15 rc6, that doesn't seem to be the 
>>> issue. 
>>> So it could be that the culprit went in pretty earlier in the 4.15 cycle.
>>>
>>> The 4.15-rc6 kernel boots fine on bare metal, as does a 4.14.6 kernel on 
>>> xen-unstable.
>>>
>>> Hopefully you have a pointer to what is wrong, if not i can try to do a 
>>> bisect.
>>
>> A bisect would be very welcome.
> 
> Hi Juergen / Boris / Pavel,
> 
> Bisection result is:
> 
> a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b is the first bad commit
> commit a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b
> Author: Pavel Tatashin 
> Date:   Wed Nov 15 17:36:31 2017 -0800
> 
> mm: zero reserved and unavailable struct pages
> 
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by
> going through __init_single_page().
> 
> In some cases these struct pages are accessed even if they do not
> contain any data.  One example is page_to_pfn() might access page->flags
> if this is where section information is stored (CONFIG_SPARSEMEM,
> SECTION_IN_PAGE_FLAGS).
> 
> One example of such memory: trim_low_memory_range() unconditionally
> reserves from pfn 0, but e820__memblock_setup() might provide the
> exiting memory from pfn 1 (i.e.  KVM).
> 
> Since struct pages are zeroed in __init_single_page(), and not during
> allocation time, we must zero such struct pages explicitly.
> 
> The patch involves adding a new memblock iterator:
> for_each_resv_unavail_range(i, p_start, p_end)
> 
> Which iterates through reserved && !memory lists, and we zero struct pages
> explicitly by calling mm_zero_struct_page().
> 
> ===
> 
> Here is more detailed example of problem that this patch is addressing:
> 
> Run tested on qemu with the following arguments:
> 
> -enable-kvm -cpu kvm64 -m 512 -smp 2
> 
> This patch reports that there are 98 unavailable pages.
> 
> They are: pfn 0 and pfns in range [159, 255].
> 
> Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
> not reserve [159, 255] ones.
> 
> e820__memblock_setup() reports linux that the following physical ranges 
> are
> available:
> [1 , 158]
> [256, 130783]
> 
> Notice, that exactly unavailable pfns are missing!
> 
> Now, lets check what we have in zone 0: [1, 131039]
> 
> pfn 0, is not part of the zone, but pfns [1, 158], are.
> 
> However, the bigger problem we have if we do not initialize these struct
> pages is with memory hotplug.  Because, that path operates at 2M
> boundaries (section_nr).  And checks if 2M range of pages is hot
> removable.  It starts with first pfn from zone, rounds it down to 2M
> boundary (sturct pages are allocated at 2M boundaries when vmemmap is
> created), and checks if that section is hot removable.  In this case
> start with pfn 1 and convert it down to pfn 0.  Later pfn is converted
> to struct page, and some fields are checked.  Now, if we do not zero
> struct pages, we get unpredictable results.
> 
> In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
> vmemmap memory to ones, the following panic is observed with kernel test
> without this patch applied:
> 
>   BUG: unable to handle kernel NULL pointer dereference at  (null)
>   IP: is_pageblock_removable_nolock+0x35/0x90
>   PGD 0 P4D 0
>   Oops:  [#1] PREEMPT
>   ...
>   task: 88001f

Re: [Xen-devel] Linux 4.15-rc6 + xen-unstable: BUG: unable to handle kernel NULL pointer dereference at (null), [ 0.000000] IP: zero_resv_unavail+0x8e/0xe1

2018-01-09 Thread Sander Eikelenboom
On 09/01/18 17:16, Pavel Tatashin wrote:
> Hi Juergen,
> 
> Do you have this patch applied:
> 
> https://github.com/torvalds/linux/commit/e8c24773d6b2cd9bc8b36bd6e60beff599be14be

Seems this hasn't made it to Linus yet ?

I will give it a test and report back, thanks !

> 
> Thank you,
> Pavel
> 
> On 01/09/2018 11:10 AM, Juergen Gross wrote:
>> On 09/01/18 16:29, Sander Eikelenboom wrote:
>>> Since it's already rc7:
>>> "Give me a subtle ping, Vasili. One subtle ping only, please."
>>
>> I like that film :-)
:)

--
Sander

>> Pavel, can you please comment? Do you have an idea how to repair the
>> issue or should we revert your patch in 4.15?
>>
>>
>> Juergen
>>
>>>
>>> On 04/01/18 21:02, Sander Eikelenboom wrote:
>>>> On 04/01/18 12:44, Juergen Gross wrote:
>>>>> On 04/01/18 11:17, Sander Eikelenboom wrote:
>>>>>> Hi Boris / Juergen,
>>>>>>
>>>>>> First of all best wishes for a quite turbulent starting new year.
>>>>>>
>>>>>> Now the holidays are over I finally gotten to test a linux 4.15-rc6 
>>>>>> kernel
>>>>>> and experienced a crash in early dom0 boot on my system (AMD phenom x6).
>>>>>>
>>>>>> I tested some earlier linux 4.15 rc's but experienced crashes then as 
>>>>>> well,
>>>>>> but didn't have time to setup serial console to send them in
>>>>>> (and waited to see if the issue Boris fixed with AMD PCI 64bit bar's 
>>>>>> could be it).
>>>>>>
>>>>>> But since that patch went in before 4.15 rc6, that doesn't seem to be 
>>>>>> the issue.
>>>>>> So it could be that the culprit went in pretty earlier in the 4.15 cycle.
>>>>>>
>>>>>> The 4.15-rc6 kernel boots fine on bare metal, as does a 4.14.6 kernel on 
>>>>>> xen-unstable.
>>>>>>
>>>>>> Hopefully you have a pointer to what is wrong, if not i can try to do a 
>>>>>> bisect.
>>>>>
>>>>> A bisect would be very welcome.
>>>>
>>>> Hi Juergen / Boris / Pavel,
>>>>
>>>> Bisection result is:
>>>>
>>>> a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b is the first bad commit
>>>> commit a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b
>>>> Author: Pavel Tatashin 
>>>> Date:   Wed Nov 15 17:36:31 2017 -0800
>>>>
>>>>  mm: zero reserved and unavailable struct pages
>>>>  
>>>>  Some memory is reserved but unavailable: not present in 
>>>> memblock.memory
>>>>  (because not backed by physical pages), but present in 
>>>> memblock.reserved.
>>>>  Such memory has backing struct pages, but they are not initialized by
>>>>  going through __init_single_page().
>>>>  
>>>>  In some cases these struct pages are accessed even if they do not
>>>>  contain any data.  One example is page_to_pfn() might access 
>>>> page->flags
>>>>  if this is where section information is stored (CONFIG_SPARSEMEM,
>>>>  SECTION_IN_PAGE_FLAGS).
>>>>  
>>>>  One example of such memory: trim_low_memory_range() unconditionally
>>>>  reserves from pfn 0, but e820__memblock_setup() might provide the
>>>>  exiting memory from pfn 1 (i.e.  KVM).
>>>>  
>>>>  Since struct pages are zeroed in __init_single_page(), and not during
>>>>  allocation time, we must zero such struct pages explicitly.
>>>>  
>>>>  The patch involves adding a new memblock iterator:
>>>>  for_each_resv_unavail_range(i, p_start, p_end)
>>>>  
>>>>  Which iterates through reserved && !memory lists, and we zero struct 
>>>> pages
>>>>  explicitly by calling mm_zero_struct_page().
>>>>  
>>>>  ===
>>>>  
>>>>  Here is more detailed example of problem that this patch is 
>>>> addressing:
>>>>  
>>>>  Run tested on qemu with the following arguments:
>>>>  
>>>>  -enable-kvm -cpu kvm64 -m 512 -smp 2
>>>>  
>>>>  This patch reports that there are 98 unavailable pages.
>>>>   

Re: [Xen-devel] Linux 4.15-rc6 + xen-unstable: BUG: unable to handle kernel NULL pointer dereference at (null), [ 0.000000] IP: zero_resv_unavail+0x8e/0xe1

2018-01-09 Thread Sander Eikelenboom
On 09/01/18 17:38, Boris Ostrovsky wrote:
> On 01/09/2018 11:31 AM, Sander Eikelenboom wrote:
>> On 09/01/18 17:16, Pavel Tatashin wrote:
>>> Hi Juergen,
>>>
>>> Do you have this patch applied:
>>>
>>> https://github.com/torvalds/linux/commit/e8c24773d6b2cd9bc8b36bd6e60beff599be14be
>> Seems this hasn't made it to Linus yet ?

Hmm that was a stupid remark, since the link actually is to Linus his
github repo :p (though not his git.kernel.org repo).

>> I will give it a test and report back, thanks !

Test turns out the patch helps and dom0 boots fine now.
Thanks !

> 
> 
> BTW, I assume this problem goes away if you don't specify dom0_mem?

Haven't tested, since i need the dom0_mem for pci-passthrough.

> -boris
> 

--
Sander


>>
>>> Thank you,
>>> Pavel
>>>
>>> On 01/09/2018 11:10 AM, Juergen Gross wrote:
>>>> On 09/01/18 16:29, Sander Eikelenboom wrote:
>>>>> Since it's already rc7:
>>>>> "Give me a subtle ping, Vasili. One subtle ping only, please."
>>>> I like that film :-)
>> :)
>>
>> --
>> Sander
>>
>>>> Pavel, can you please comment? Do you have an idea how to repair the
>>>> issue or should we revert your patch in 4.15?
>>>>
>>>>
>>>> Juergen
>>>>
>>>>> On 04/01/18 21:02, Sander Eikelenboom wrote:
>>>>>> On 04/01/18 12:44, Juergen Gross wrote:
>>>>>>> On 04/01/18 11:17, Sander Eikelenboom wrote:
>>>>>>>> Hi Boris / Juergen,
>>>>>>>>
>>>>>>>> First of all best wishes for a quite turbulent starting new year.
>>>>>>>>
>>>>>>>> Now the holidays are over I finally gotten to test a linux 4.15-rc6 
>>>>>>>> kernel
>>>>>>>> and experienced a crash in early dom0 boot on my system (AMD phenom 
>>>>>>>> x6).
>>>>>>>>
>>>>>>>> I tested some earlier linux 4.15 rc's but experienced crashes then as 
>>>>>>>> well,
>>>>>>>> but didn't have time to setup serial console to send them in
>>>>>>>> (and waited to see if the issue Boris fixed with AMD PCI 64bit bar's 
>>>>>>>> could be it).
>>>>>>>>
>>>>>>>> But since that patch went in before 4.15 rc6, that doesn't seem to be 
>>>>>>>> the issue.
>>>>>>>> So it could be that the culprit went in pretty earlier in the 4.15 
>>>>>>>> cycle.
>>>>>>>>
>>>>>>>> The 4.15-rc6 kernel boots fine on bare metal, as does a 4.14.6 kernel 
>>>>>>>> on xen-unstable.
>>>>>>>>
>>>>>>>> Hopefully you have a pointer to what is wrong, if not i can try to do 
>>>>>>>> a bisect.
>>>>>>> A bisect would be very welcome.
>>>>>> Hi Juergen / Boris / Pavel,
>>>>>>
>>>>>> Bisection result is:
>>>>>>
>>>>>> a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b is the first bad commit
>>>>>> commit a4a3ede2132ae0863e2d43e06f9b5697c51a7a3b
>>>>>> Author: Pavel Tatashin 
>>>>>> Date:   Wed Nov 15 17:36:31 2017 -0800
>>>>>>
>>>>>>  mm: zero reserved and unavailable struct pages
>>>>>>  
>>>>>>  Some memory is reserved but unavailable: not present in 
>>>>>> memblock.memory
>>>>>>  (because not backed by physical pages), but present in 
>>>>>> memblock.reserved.
>>>>>>  Such memory has backing struct pages, but they are not initialized 
>>>>>> by
>>>>>>  going through __init_single_page().
>>>>>>  
>>>>>>  In some cases these struct pages are accessed even if they do not
>>>>>>  contain any data.  One example is page_to_pfn() might access 
>>>>>> page->flags
>>>>>>  if this is where section information is stored (CONFIG_SPARSEMEM,
>>>>>>  SECTION_IN_PAGE_FLAGS).
>>>>>>  
>>>>>>  One example of such memory: trim_low_memory_range() unconditionally
>>>>>>  reserves from pfn 0, 

Xen-unstable :can't boot HVM guests, bisected to commit: "hvmloader: indicate ACPI tables with "ACPI data" type in e820"

2020-10-10 Thread Sander Eikelenboom
Hi Igor/Jan,

I tried to update my AMD machine to current xen-unstable, but
unfortunately the HVM guests don't boot after that. The guest keeps
using CPU-cycles but I don't get to a command prompt (or any output at
all). PVH guests run fine.

Bisection leads to commit:

8efa46516c5f4cf185c8df179812c185d3c27eb6
hvmloader: indicate ACPI tables with "ACPI data" type in e820

I tried xen-unstable with this commit reverted and with that everything
works fine.

I attached the xl-dmesg output.

--
Sander
 __  ___  __  __ _
 \ \/ /___ _ __   | || |  / | ___|_   _ _ __  ___| |_ __ _| |__ | | ___
  \  // _ \ '_ \  | || |_ | |___ \ __| | | | '_ \/ __| __/ _` | '_ \| |/ _ \
  /  \  __/ | | | |__   _|| |___) |__| |_| | | | \__ \ || (_| | |_) | |  __/
 /_/\_\___|_| |_||_|(_)_|/\__,_|_| |_|___/\__\__,_|_.__/|_|\___|

(XEN) [001a935d33ae] Xen version 4.15-unstable (r...@dyndns.org) (gcc 
(Debian 8.3.0-6) 8.3.0) debug=y  Sat Oct 10 17:42:56 CEST 2020
(XEN) [001a9a35344a] Latest ChangeSet: Fri Oct 2 12:30:34 2020 +0200 
git:8a62dee9ce
(XEN) [001a9e9ef6a0] build-id: 517916349a46fdab2faadfefa7051928a1594796
(XEN) [001aa2709d53] Bootloader: GRUB 2.02+dfsg1-20+deb10u2
(XEN) [001aa5a9cace] Command line: dom0_mem=2048M,max:2048M loglvl=all 
guest_loglvl=all console_timestamps=datems vga=gfx-1280x1024x32 no-cpuidle 
com1=38400,8n1 console=vga,com1 ivrs_ioapic[6]=00:14.0 iommu=on,verbose,debug 
conring_size=128k ucode=scan sched=credit2 gnttab_max_frames=64 reboot=a
(XEN) [001ab4a3c9f3] Xen image load base address: 0
(XEN) [001ab7775ae3] Video information:
(XEN) [001ab9b2a306]  VGA is graphics mode 1280x1024, 32 bpp
(XEN) [001abcf88912]  VBE/DDC methods: none; EDID transfer time: 0 seconds
(XEN) [001ac0f0387e]  EDID info not retrieved because no DDC retrieval 
method detected
(XEN) [001ac5802e2e] Disc information:
(XEN) [001ac7aeb1c3]  Found 4 MBR signatures
(XEN) [001aca29768e]  Found 4 EDD information structures
(XEN) [001acd3c9a83] CPU Vendor: AMD, Family 16 (0x10), Model 10 (0xa), 
Stepping 0 (raw 00100fa0)
(XEN) [001ad2584e1b] Xen-e820 RAM map:
(XEN) [001ad486cf30]  [, 000963ff] (usable)
(XEN) [001ad8258d5b]  [00096400, 0009] (reserved)
(XEN) [001adbddb57b]  [000e4000, 000f] (reserved)
(XEN) [001adf95eaba]  [0010, c7f8] (usable)
(XEN) [001ae334a8b2]  [c7f9, c7f9dfff] (ACPI data)
(XEN) [001ae6f9844a]  [c7f9e000, c7fd] (ACPI NVS)
(XEN) [001aeab1a6f6]  [c7fe, c7ff] (reserved)
(XEN) [001aee69c5ea]  [ffe0, ] (reserved)
(XEN) [001af221e626]  [0001, 0006b7ff] (usable)
(XEN) [001afd7b593b] New Xen image base address: 0xc780
(XEN) [001b00b46a62] ACPI: RSDP 000FB100, 0014 (r0 ACPIAM)
(XEN) [001b03e1003b] ACPI: RSDT C7F9, 0048 (r1 MSIOEMSLIC  20100913 
MSFT   97)
(XEN) [001b08a3c463] ACPI: FACP C7F90200, 0084 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001b0dbb] ACPI: DSDT C7F905E0, 9427 (r1  A7640 A7640100  100 
INTL 20051117)
(XEN) [001b12293fb6] ACPI: FACS C7F9E000, 0040
(XEN) [001b14bd51ea] ACPI: APIC C7F90390, 0088 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001b198022fa] ACPI: MCFG C7F90420, 003C (r1 7640MS OEMMCFG  20100913 
MSFT   97)
(XEN) [001b1e42e08a] ACPI: SLIC C7F90460, 0176 (r1 MSIOEMSLIC  20100913 
MSFT   97)
(XEN) [001b2305a2d0] ACPI: OEMB C7F9E040, 0072 (r1 7640MS A7640100 20100913 
MSFT   97)
(XEN) [001b27c851ea] ACPI: SRAT C7F9A5E0, 0108 (r3 AMDFAM_F_102 
AMD 1)
(XEN) [001b2c8b11c0] ACPI: HPET C7F9A6F0, 0038 (r1 7640MS OEMHPET  20100913 
MSFT   97)
(XEN) [001b314de060] ACPI: IVRS C7F9A730, 0108 (r1  AMD RD890S   202031 
AMD 0)
(XEN) [001b3610939b] ACPI: SSDT C7F9A840, 0DA4 (r1 A M I  POWERNOW1 
AMD 1)
(XEN) [001b3ad35668] System RAM: 26623MB (27262104kB)
(XEN) [001b48bbc278] SRAT: PXM 0 -> APIC 00 -> Node 0
(XEN) [001b4ba8c92a] SRAT: PXM 0 -> APIC 01 -> Node 0
(XEN) [001b4e95dc3e] SRAT: PXM 0 -> APIC 02 -> Node 0
(XEN) [001b5182ca02] SRAT: PXM 0 -> APIC 03 -> Node 0
(XEN) [001b546feeb3] SRAT: PXM 0 -> APIC 04 -> Node 0
(XEN) [001b575cf4aa] SRAT: PXM 0 -> APIC 05 -> Node 0
(XEN) [001b5a49f5f6] SRAT: Node 0 PXM 0 0-a
(XEN) [001b5cead2e2] SRAT: Node 0 PXM 0 10-c800
(XEN) [001b5ff1399e] SRAT: Node 0 PXM 0 1-6b800
(XEN) [001b632a6bf3] NUMA: Allocated memnodemap from 6a4d2a000 - 6a4d31000
(XEN) [001b6721fc52] NUMA: Using 8 for the hash shift.
(XEN) [001bc77a7565] Domain heap initialised
(XEN) [001bc9f546a3] Allocated console ring of 128 KiB.
(XEN) [001be01a5586] vesafb: framebuffer at 0xfb00, mapped to 
0x82

Re: [SUSPECTED SPAM]Xen-unstable :can't boot HVM guests, bisected to commit: "hvmloader: indicate ACPI tables with "ACPI data" type in e820"

2020-10-11 Thread Sander Eikelenboom
On 11/10/2020 02:06, Igor Druzhinin wrote:
> On 10/10/2020 18:51, Sander Eikelenboom wrote:
>> Hi Igor/Jan,
>>
>> I tried to update my AMD machine to current xen-unstable, but
>> unfortunately the HVM guests don't boot after that. The guest keeps
>> using CPU-cycles but I don't get to a command prompt (or any output at
>> all). PVH guests run fine.
>>
>> Bisection leads to commit:
>>
>> 8efa46516c5f4cf185c8df179812c185d3c27eb6
>> hvmloader: indicate ACPI tables with "ACPI data" type in e820
>>
>> I tried xen-unstable with this commit reverted and with that everything
>> works fine.
>>
>> I attached the xl-dmesg output.
> 
> What guests are you using? 
Not sure I understand what you ask for, but:
dom0 PV
guest HVM (qemu-xen)

> Could you get serial output from the guest?
Not getting any, it seems to be stuck in very early boot.

> Is it AMD specific?
Can't tell, this is the only machine I test xen-unstable on.
It's a AMD phenom X6.
Both dom0 and guest kernel are 5.9-rc8.

Tested with guest config:
kernel  = '/boot/vmlinuz-xen-guest'
ramdisk = '/boot/initrd.img-xen-guest'

cmdline = 'root=UUID=7cc4a90d-d6b0-4958-bb7d-50497aa29f18 ro
nomodeset console=tty1 console=ttyS0 console=hvc0 earlyprintk=xen'

type='hvm'

device_model_version = 'qemu-xen'

cpus= "2-5"
vcpus = 2

memory  = '512'

disk= [
  'phy:/dev/xen_vms_ssd/media,xvda,w'
  ]

name= 'guest'

vif = [ 'bridge=xen_bridge,ip=192.168.1.10,mac=00:16:3E:DC:0A:F1' ]

on_poweroff = 'destroy'
on_reboot   = 'restart'
on_crash= 'preserve'

vnc=0


>If it's a Linux guest could you get a stacktrace from
> the guest using xenctx?

It is, here are few subsequent runs:

~# /usr/local/lib/xen/bin/xenctx -s
/boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
vcpu0:
cs:eip: ca80:0256
flags: 0016 nz a p
ss:esp: :6f38
eax: 029e0012   ebx: fb00   ecx: 028484e3   edx: 0511
esi:    edi: f97b7363   ebp: 6f38
 ds: ca80es: 0010fs: gs: 

cr0: 0011
cr2: 
cr3: 0040
cr4: 

dr0: 
dr1: 
dr2: 
dr3: 
dr6: 0ff0
dr7: 0400
Code (instr addr 0256)
ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff <00> f0
53 ff 00 f0 53 ff 00 f0 53



vcpu1 offline

~# /usr/local/lib/xen/bin/xenctx -s
/boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
vcpu0:
cs:eip: ca80:0256
flags: 0016 nz a p
ss:esp: :6f38
eax: 029e0012   ebx: fb00   ecx: 028444b7   edx: 0511
esi:    edi: f97bb38f   ebp: 6f38
 ds: ca80es: 0010fs: gs: 

cr0: 0011
cr2: 
cr3: 0040
cr4: 

dr0: 
dr1: 
dr2: 
dr3: 
dr6: 0ff0
dr7: 0400
Code (instr addr 0256)
ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff <00> f0
53 ff 00 f0 53 ff 00 f0 53



vcpu1 offline

~# /usr/local/lib/xen/bin/xenctx -s
/boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
vcpu0:
cs:eip: ca80:0256
flags: 0016 nz a p
ss:esp: :6f38
eax: 029e0012   ebx: fb00   ecx: 02840901   edx: 0511
esi:    edi: f97bef45   ebp: 6f38
 ds: ca80es: 0010fs: gs: 

cr0: 0011
cr2: 
cr3: 0040
cr4: 

dr0: 
dr1: 
dr2: 
dr3: 
dr6: 0ff0
dr7: 0400
Code (instr addr 0256)
ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff <00> f0
53 ff 00 f0 53 ff 00 f0 53



vcpu1 offline

~# /usr/local/lib/xen/bin/xenctx -s
/boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
vcpu0:
cs:eip: ca80:0256
flags: 0016 nz a p
ss:esp: :6f38
eax: 029e0012   ebx: fb00   ecx: 0283d4bd   edx: 0511
esi:    edi: f97c2389   ebp: 6f38
 ds: ca80es: 0010fs: gs: 

cr0: 0011
cr2: 
cr3: 0040
cr4: 

dr0: 
dr1: 
dr2: 
dr3: 
dr6: 0ff0
dr7: 0400
Code (instr addr 0256)
ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff <00> f0
53 ff 00 f0 53 ff 00 f0 53



vcpu1 offline

~# /usr/local/lib/xen/bin/xenctx -s
/boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
vcpu0:
cs:eip: ca80:0256
flags: 0016 nz a p
ss:esp: :6f38
eax: 029e0012   ebx: fb00   ecx: 02838e90   edx: 0511
esi:    edi: f97c69b6   ebp: 6f38
 ds: ca80es: 0010fs: gs: 

cr0: 0011
cr2: 
cr3: 0040
cr4: 

dr0: 
dr1: 
dr2:

Re: [SUSPECTED SPAM]Xen-unstable :can't boot HVM guests, bisected to commit: "hvmloader: indicate ACPI tables with "ACPI data" type in e820"

2020-10-11 Thread Sander Eikelenboom
On 11/10/2020 13:20, Igor Druzhinin wrote:
> On 11/10/2020 11:40, Igor Druzhinin wrote:
>> On 11/10/2020 10:43, Sander Eikelenboom wrote:
>>> On 11/10/2020 02:06, Igor Druzhinin wrote:
>>>> On 10/10/2020 18:51, Sander Eikelenboom wrote:
>>>>> Hi Igor/Jan,
>>>>>
>>>>> I tried to update my AMD machine to current xen-unstable, but
>>>>> unfortunately the HVM guests don't boot after that. The guest keeps
>>>>> using CPU-cycles but I don't get to a command prompt (or any output at
>>>>> all). PVH guests run fine.
>>>>>
>>>>> Bisection leads to commit:
>>>>>
>>>>> 8efa46516c5f4cf185c8df179812c185d3c27eb6
>>>>> hvmloader: indicate ACPI tables with "ACPI data" type in e820
>>>>>
>>>>> I tried xen-unstable with this commit reverted and with that everything
>>>>> works fine.
>>>>>
>>>>> I attached the xl-dmesg output.
>>>>
>>>> What guests are you using? 
>>> Not sure I understand what you ask for, but:
>>> dom0 PV
>>> guest HVM (qemu-xen)
>>>
>>>> Could you get serial output from the guest?
>>> Not getting any, it seems to be stuck in very early boot.
>>>
>>>> Is it AMD specific?
>>> Can't tell, this is the only machine I test xen-unstable on.
>>> It's a AMD phenom X6.
>>> Both dom0 and guest kernel are 5.9-rc8.
>>>
>>> Tested with guest config:
>>> kernel  = '/boot/vmlinuz-xen-guest'
>>> ramdisk = '/boot/initrd.img-xen-guest'
>>>
>>> cmdline = 'root=UUID=7cc4a90d-d6b0-4958-bb7d-50497aa29f18 ro
>>> nomodeset console=tty1 console=ttyS0 console=hvc0 earlyprintk=xen'
>>>
>>> type='hvm'
>>>
>>> device_model_version = 'qemu-xen'
>>>
>>> cpus= "2-5"
>>> vcpus = 2
>>>
>>> memory  = '512'
>>>
>>> disk= [
>>>   'phy:/dev/xen_vms_ssd/media,xvda,w'
>>>   ]
>>>
>>> name= 'guest'
>>>
>>> vif = [ 'bridge=xen_bridge,ip=192.168.1.10,mac=00:16:3E:DC:0A:F1' ]
>>>
>>> on_poweroff = 'destroy'
>>> on_reboot   = 'restart'
>>> on_crash= 'preserve'
>>>
>>> vnc=0
>>>
>>>
>>>> If it's a Linux guest could you get a stacktrace from
>>>> the guest using xenctx?
>>>
>>> It is, here are few subsequent runs:
>>>
>>> ~# /usr/local/lib/xen/bin/xenctx -s
>>> /boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
>>> vcpu0:
>>> cs:eip: ca80:0256
>>
>> Ok, it's stuck in linuxboot.bin option ROM. That's not something we test in 
>> Citrix -
>> we don't use fw_cfg. It could be something with caching (given it's moving 
>> but slowly) or a
>> bug uncovered by memory map changes. I'll try to get a repro on Monday.
> 
> Right, I think I know what will fix your problem - could you flip "ACPI data"
> type to "ACPI NVS" in my commit.

Just did and the guest now boots fine.

--
Sander

> Jan, this is what we've discussed on the list as an ambiguity in ACPI spec but
> couldn't reach a clean resolution after all.
> SeaBIOS thinks that "ACPI data" type is essentially RAM that could be reported
> as RAM resource to the guest in E801.
> https://wiki.osdev.org/Detecting_Memory_(x86)#BIOS_Function:_INT_0x15.2C_AX_.3D_0xE801
> 
> // Calculate the maximum ramsize (less than 4gig) from e820 map.
> static void
> calcRamSize(void)
> {
> u32 rs = 0;
> int i;
> for (i=e820_count-1; i>=0; i--) {
> struct e820entry *en = &e820_list[i];
> u64 end = en->start + en->size;
> u32 type = en->type;
> if (end <= 0x && (type == E820_ACPI || type == E820_RAM)) {
> rs = end;
> break;
> }
> }
> LegacyRamSize = rs >= 1024*1024 ? rs : 1024*1024;
> }
> 
> what is wrong here I think is that it clearly doesn't handle holes and worked 
> more
> by luck. So SeaBIOS needs to be fixed but I think that using ACPI NVS in 
> hvmloader
> is still safer.
> 
> Igor
> 




Re: [PATCH v2] hvmloader: flip "ACPI data" to "ACPI NVS" type for ACPI table region

2020-10-16 Thread Sander Eikelenboom
On 16/10/2020 08:34, Jan Beulich wrote:
> On 16.10.2020 02:39, Igor Druzhinin wrote:
>> ACPI specification contains statements describing memory marked with regular
>> "ACPI data" type as reclaimable by the guest. Although the guest shouldn't
>> really do it if it wants kexec or similar functionality to work, there
>> could still be ambiguities in treating these regions as potentially regular
>> RAM.
>>
>> One such example is SeaBIOS which currently reports "ACPI data" regions as
>> RAM to the guest in its e801 call. Which it might have the right to do as any
>> user of this is expected to be ACPI unaware. But a QEMU bootloader later 
>> seems
>> to ignore that fact and is instead using e801 to find a place for initrd 
>> which
>> causes the tables to be erased. While arguably QEMU bootloader or SeaBIOS 
>> need
>> to be fixed / improved here, that is just one example of the potential 
>> problems
>> from using a reclaimable memory type.
>>
>> Flip the type to "ACPI NVS" which doesn't have this ambiguity in it and is
>> described by the spec as non-reclaimable (so cannot ever be treated like 
>> RAM).
>>
>> Signed-off-by: Igor Druzhinin 
> 
> Acked-by: Jan Beulich 
> 
> 

I don't see any stable and or fixes tags, but I assume this will go to
the stable trees (which have (a backport of)
8efa46516c5f4cf185c8df179812c185d3c27eb6 in their staging branches) ?

(and as reporter it would have been nice to have been CC'ed on the patch)

--
Sander



Re: preparations for 4.13.1 and 4.12.3

2020-04-15 Thread Sander Eikelenboom
On 09/04/2020 09:41, Jan Beulich wrote:
> All,
> 
> the releases are due in a week or two. Please point out backports
> you find missing from the respective staging branches, but which
> you consider relevant. (Ian, I notice there haven't been any
> tools side backports at all so far. Julien, Stefano - same for
> Arm.)
> 
> Jan

I would like to suggest for 4.13.1:

4b5b431edd984b26f43b3efc7de465f3560a949e "tools/xentop: Fix calculation
of used memory"

Thanks,

--
Sander





Re: [Xen-devel] Xen-unstable: pci-passthrough regression bisected to: x86/smp: use APIC ALLBUT destination shorthand when possible

2020-02-10 Thread Sander Eikelenboom
On 03/02/2020 14:21, Roger Pau Monné wrote:
> On Mon, Feb 03, 2020 at 01:44:06PM +0100, Sander Eikelenboom wrote:
>> On 03/02/2020 13:41, Roger Pau Monné wrote:
>>> On Mon, Feb 03, 2020 at 01:30:55PM +0100, Sander Eikelenboom wrote:
>>>> On 03/02/2020 13:23, Roger Pau Monné wrote:
>>>>> On Mon, Feb 03, 2020 at 09:33:51AM +0100, Sander Eikelenboom wrote:
>>>>>> Hi Roger,
>>>>>>
>>>>>> Last week I encountered an issue with the PCI-passthrough of a USB 
>>>>>> controller. 
>>>>>> In the guest I get:
>>>>>> [ 1143.313756] xhci_hcd :00:05.0: xHCI host not responding to 
>>>>>> stop endpoint command.
>>>>>> [ 1143.334825] xhci_hcd :00:05.0: xHCI host controller not 
>>>>>> responding, assume dead
>>>>>> [ 1143.347364] xhci_hcd :00:05.0: HC died; cleaning up
>>>>>> [ 1143.356407] usb 1-2: USB disconnect, device number 2
>>>>>>
>>>>>> Bisection turned up as the culprit: 
>>>>>>commit 5500d265a2a8fa63d60c08beb549de8ec82ff7a5
>>>>>>x86/smp: use APIC ALLBUT destination shorthand when possible
>>>>>
>>>>> Sorry to hear that, let see if we can figure out what's wrong.
>>>>
>>>> No problem, that is why I test stuff :)
>>>>
>>>>>> I verified by reverting that commit and now it works fine again.
>>>>>
>>>>> Does the same controller work fine when used in dom0?
>>>>
>>>> Will test that, but as all other pci devices in dom0 work fine,
>>>> I assume this controller would also work fine in dom0 (as it has also
>>>> worked fine for ages with PCI-passthrough to that guest and still works
>>>> fine when reverting the referenced commit).
>>>
>>> Is this the only device that fails to work when doing pci-passthrough,
>>> or other devices also don't work with the mentioned change applied?
>>>
>>> Have you tested on other boxes?
>>>
>>>> I don't know if your change can somehow have a side effect
>>>> on latency around the processing of pci-passthrough ?
>>>
>>> Hm, the mentioned commit should speed up broadcast IPIs, but I don't
>>> see how it could slow down other interrupts. Also I would think the
>>> domain is not receiving interrupts from the device, rather than
>>> interrupts being slow.
>>>
>>> Can you also paste the output of lspci -v for that xHCI device from
>>> dom0?
>>>
>>> Thanks, Roger.
>>
>> Will do this evening including the testing in dom0 etc.
>> Will also see if there is any pattern when observing /proc/interrupts in
>> the guest.
> 
> Thanks! I also have some trivial patch that I would like you to try,
> just to discard send_IPI_mask clearing the scratch_cpumask under
> another function feet.
> 
> Roger.

Hi Roger,

Took a while, but I was able to run some tests now.

I also forgot a detail in the first report (probably still a bit tired from 
FOSDEM), 
namely: the device passedthrough works OK for a while before I get the kernel 
message.

I tested the patch and it looks like it makes the issue go away,
I tested for a day, while without the patch (or revert of the commit) the device
will give problems within a few hours.

lspci output from dom0 for this device is below.

--
Sander




lspci -vvvknn -s 08:00.0
08:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host 
Controller [1033:0194] (rev 03) (prog-if 30 [XHCI])
Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [1043:8413]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-  ---
> diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
> index 65eb7cbda8..aeeb506155 100644
> --- a/xen/arch/x86/smp.c
> +++ b/xen/arch/x86/smp.c
> @@ -66,7 +66,8 @@ static void send_IPI_shortcut(unsigned int shortcut, int 
> vector,
>  void send_IPI_mask(const cpumask_t *mask, int vector)
>  {
>  bool cpus_locked = false;
> -cpumask_t *scratch = this_cpu(scratch_cpumask);
> +static DEFINE_PER_CPU(cpumask_t, send_ipi_cpumask);
> +cpumask_t *scratch = &this_cpu(send_ipi_cpumask);
>  
>  /*
>   * This can only be safely used when no CPU hotplug or unplug operations
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Xen-unstable: pci-passthrough regression bisected to: x86/smp: use APIC ALLBUT destination shorthand when possible

2020-02-12 Thread Sander Eikelenboom
On 11/02/2020 15:00, Roger Pau Monné wrote:
> On Mon, Feb 10, 2020 at 09:49:30PM +0100, Sander Eikelenboom wrote:
>> On 03/02/2020 14:21, Roger Pau Monné wrote:
>>> On Mon, Feb 03, 2020 at 01:44:06PM +0100, Sander Eikelenboom wrote:
>>>> On 03/02/2020 13:41, Roger Pau Monné wrote:
>>>>> On Mon, Feb 03, 2020 at 01:30:55PM +0100, Sander Eikelenboom wrote:
>>>>>> On 03/02/2020 13:23, Roger Pau Monné wrote:
>>>>>>> On Mon, Feb 03, 2020 at 09:33:51AM +0100, Sander Eikelenboom wrote:
>>>>>>>> Hi Roger,
>>>>>>>>
>>>>>>>> Last week I encountered an issue with the PCI-passthrough of a USB 
>>>>>>>> controller. 
>>>>>>>> In the guest I get:
>>>>>>>> [ 1143.313756] xhci_hcd :00:05.0: xHCI host not responding to 
>>>>>>>> stop endpoint command.
>>>>>>>> [ 1143.334825] xhci_hcd :00:05.0: xHCI host controller not 
>>>>>>>> responding, assume dead
>>>>>>>> [ 1143.347364] xhci_hcd :00:05.0: HC died; cleaning up
>>>>>>>> [ 1143.356407] usb 1-2: USB disconnect, device number 2
>>>>>>>>
>>>>>>>> Bisection turned up as the culprit: 
>>>>>>>>commit 5500d265a2a8fa63d60c08beb549de8ec82ff7a5
>>>>>>>>x86/smp: use APIC ALLBUT destination shorthand when possible
>>>>>>>
>>>>>>> Sorry to hear that, let see if we can figure out what's wrong.
>>>>>>
>>>>>> No problem, that is why I test stuff :)
>>>>>>
>>>>>>>> I verified by reverting that commit and now it works fine again.
>>>>>>>
>>>>>>> Does the same controller work fine when used in dom0?
>>>>>>
>>>>>> Will test that, but as all other pci devices in dom0 work fine,
>>>>>> I assume this controller would also work fine in dom0 (as it has also
>>>>>> worked fine for ages with PCI-passthrough to that guest and still works
>>>>>> fine when reverting the referenced commit).
>>>>>
>>>>> Is this the only device that fails to work when doing pci-passthrough,
>>>>> or other devices also don't work with the mentioned change applied?
>>>>>
>>>>> Have you tested on other boxes?
>>>>>
>>>>>> I don't know if your change can somehow have a side effect
>>>>>> on latency around the processing of pci-passthrough ?
>>>>>
>>>>> Hm, the mentioned commit should speed up broadcast IPIs, but I don't
>>>>> see how it could slow down other interrupts. Also I would think the
>>>>> domain is not receiving interrupts from the device, rather than
>>>>> interrupts being slow.
>>>>>
>>>>> Can you also paste the output of lspci -v for that xHCI device from
>>>>> dom0?
>>>>>
>>>>> Thanks, Roger.
>>>>
>>>> Will do this evening including the testing in dom0 etc.
>>>> Will also see if there is any pattern when observing /proc/interrupts in
>>>> the guest.
>>>
>>> Thanks! I also have some trivial patch that I would like you to try,
>>> just to discard send_IPI_mask clearing the scratch_cpumask under
>>> another function feet.
>>>
>>> Roger.
>>
>> Hi Roger,
>>
>> Took a while, but I was able to run some tests now.
>>
>> I also forgot a detail in the first report (probably still a bit tired from 
>> FOSDEM), 
>> namely: the device passedthrough works OK for a while before I get the 
>> kernel message.
>>
>> I tested the patch and it looks like it makes the issue go away,
>> I tested for a day, while without the patch (or revert of the commit) the 
>> device
>> will give problems within a few hours.
> 
> Thanks, I have another patch for you to try, which will likely make
> your system crash. Could you give it a try and paste the log output?
> 
> Thanks, Roger.

Applied the patch, rebuild, rebooted and braced for impact ...
However the device bugged again, but no xen panic occured, so nothing
special in the logs.
I only had time to try it once, so I could retry this evening.

--
Sander




___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/3] x86: fixes/improvements for scratch cpumask

2020-02-12 Thread Sander Eikelenboom
On 12/02/2020 17:49, Roger Pau Monne wrote:
> Hello,
> 
> Commit:
> 
> 5500d265a2a8fa63d60c08beb549de8ec82ff7a5
> x86/smp: use APIC ALLBUT destination shorthand when possible
> 
> Introduced a bogus usage of the scratch cpumask: it was used in a
> function that could be called from interrupt context, and hence using
> the scratch cpumask there is not safe. Patch #2 is a fix for that usage.
> 
> Patch #3 adds some debug infrastructure to make sure the scratch cpumask
> is used in the right context, and hence should prevent further missuses.
> 
> Thanks, Roger.

Hi Roger,

Do you still want me to test the "panic" patch ?
Or test this series instead ?

--
Sander


> Roger Pau Monne (3):
>   x86/smp: unify header includes in smp.h
>   x86/smp: use a dedicated scratch cpumask in send_IPI_mask
>   x86: add accessors for scratch cpu mask
> 
>  xen/arch/x86/io_apic.c|  6 --
>  xen/arch/x86/irq.c| 13 ++---
>  xen/arch/x86/mm.c | 30 +-
>  xen/arch/x86/msi.c|  4 +++-
>  xen/arch/x86/smp.c| 14 +-
>  xen/arch/x86/smpboot.c| 33 -
>  xen/include/asm-x86/smp.h | 15 +++
>  7 files changed, 94 insertions(+), 21 deletions(-)
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/3] x86: fixes/improvements for scratch cpumask

2020-02-12 Thread Sander Eikelenboom
On 12/02/2020 18:01, Roger Pau Monné wrote:
> On Wed, Feb 12, 2020 at 05:53:39PM +0100, Sander Eikelenboom wrote:
>> On 12/02/2020 17:49, Roger Pau Monne wrote:
>>> Hello,
>>>
>>> Commit:
>>>
>>> 5500d265a2a8fa63d60c08beb549de8ec82ff7a5
>>> x86/smp: use APIC ALLBUT destination shorthand when possible
>>>
>>> Introduced a bogus usage of the scratch cpumask: it was used in a
>>> function that could be called from interrupt context, and hence using
>>> the scratch cpumask there is not safe. Patch #2 is a fix for that usage.
>>>
>>> Patch #3 adds some debug infrastructure to make sure the scratch cpumask
>>> is used in the right context, and hence should prevent further missuses.
>>>
>>> Thanks, Roger.
>>
>> Hi Roger,
>>
>> Do you still want me to test the "panic" patch ?
>> Or test this series instead ?
> 
> I've been able to trigger this myself, so if you can give a try to the
> series in order to assert it fixes your issue that would be great.
> 
> Thanks.
> 

Sure, compiling now, will report back tomorrow morning.
--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/3] x86: fixes/improvements for scratch cpumask

2020-02-13 Thread Sander Eikelenboom
On 12/02/2020 18:13, Sander Eikelenboom wrote:
> On 12/02/2020 18:01, Roger Pau Monné wrote:
>> On Wed, Feb 12, 2020 at 05:53:39PM +0100, Sander Eikelenboom wrote:
>>> On 12/02/2020 17:49, Roger Pau Monne wrote:
>>>> Hello,
>>>>
>>>> Commit:
>>>>
>>>> 5500d265a2a8fa63d60c08beb549de8ec82ff7a5
>>>> x86/smp: use APIC ALLBUT destination shorthand when possible
>>>>
>>>> Introduced a bogus usage of the scratch cpumask: it was used in a
>>>> function that could be called from interrupt context, and hence using
>>>> the scratch cpumask there is not safe. Patch #2 is a fix for that usage.
>>>>
>>>> Patch #3 adds some debug infrastructure to make sure the scratch cpumask
>>>> is used in the right context, and hence should prevent further missuses.
>>>>
>>>> Thanks, Roger.
>>>
>>> Hi Roger,
>>>
>>> Do you still want me to test the "panic" patch ?
>>> Or test this series instead ?
>>
>> I've been able to trigger this myself, so if you can give a try to the
>> series in order to assert it fixes your issue that would be great.
>>
>> Thanks.
>>
> 
> Sure, compiling now, will report back tomorrow morning.
> --
> Sander
> 

Haven't seen the issue yet, so it seems fixed.
Thanks !
--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler

2020-02-17 Thread Sander Eikelenboom
On 17/02/2020 20:58, Sarah Newman wrote:
> On 1/7/20 6:25 AM, Alastair Browne wrote:
>>
>> CONCLUSION
>>
>> So in conclusion, the tests indicate that credit2 might be unstable.
>>
>> For the time being, we are using credit as the chosen scheduler. We
>> are booting the kernel with a parameter "sched=credit" to ensure that
>> the correct scheduler is used.
>>
>> After the tests, we decided to stick with 4.9.0.9 kernel and 4.12 Xen
>> for production use running credit1 as the default scheduler.
> 
> One person CC'ed appears to be having the same experience, where the credit2 
> scheduler leads to lockups (in this case in the domU, not the dom0) under 
> relatively heavy load. It seems possible they may have the same root cause.
> 
> I don't think there are, but have there been any patches since the 4.13.0 
> release which might have fixed problems with credit 2 scheduler? If not, 
> what would the next step be to isolating the problem - a debug build of Xen 
> or something else?
> 
> If there are no merged or proposed fixes soon, it may be worth considering 
> making the credit scheduler the default again until problems with the 
> credit2 scheduler are resolved.
> 
> Thanks, Sarah
> 
> 

Hi Sarah / Alastair,

I can only provide my n=1 (OK, I'm running a bunch of boxes, some of which 
pretty over-committed CPU wise), 
but I haven't seen any issues (lately) with credit2.

I did take a look at Alastair Browne's report your replied to 
(https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html)
and I do see some differences:
- Alastair's machine has multiple sockets, my machines don't.
- It seems Alastair's config is using ballooning ? 
(dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the 
past, so my configs don't.
- kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), 
4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested
  (5.4 is available in Debian backport for buster). 
- Alastair, are you using pv, hvm or pvh guests? The report seems to miss 
the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for 
dom0) ?

Any how, could be worthwhile to test without ballooning, and test a recent 
kernel to rule out an issue with (missing) kernel backports.

--
Sander

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

  1   2   >