from:"Andrew Cooper"

Re: [Xen-devel] [PATCH] xen: Fix possible user space selector corruption

2013-10-03 Thread Andrew Cooper

On 03/10/13 09:24, Frediano Ziglio wrote:
> Due to the way kernel is initialized under Xen is possible that the ring1
> selector used by the kernel for the boot cpu end up to be copied to
> userspace leading to segmentation fault in the userspace.
>
>
> Xen code in the kernel initialize no-boot cpus with correct selectors (ds
> and es set to __USER_DS) but the boot one keep the ring1 (passed by Xen).
> On task context switch (switch_to) we assume that ds, es and cs already
> point to __USER_DS and __KERNEL_CSso these selector are not changed.
>
> If processor is an Intel that support sysenter instruction sysenter/sysexit
> is used so ds and es are not restored switching back from kernel to
> userspace. In the case the selectors point to a ring1 instead of __USER_DS
> the userspace code will crash on first memory access attempt (to be
> precise Xen on the emulated iret used to do sysexit will detect and set ds
> and es to zero which lead to GPF anyway).
>
> Now if an userspace process call kernel using sysenter and get rescheduled
> (for me it happen on a specific init calling wait4) could happen that the
> ring1 selector is set to ds and es.
>
> This is quite hard to detect cause after a while these selectors are fixed
> (__USER_DS seems sticky).
>
> Bisecting the code commit 7076aada1040de4ed79a5977dbabdb5e5ea5e249 appears
> to be the first one that have this issue.
>
> Signed-off-by: Frediano Ziglio 

In terms of the correctness of the fix,

Reviewed-by: Andrew Cooper 

However, I am not sure the comment is necessary.  The prevailing style
is for no justification of loads of segment selectors on boot, and the
comment itself refers simply to an interaction issue of 32bit on Xen
when making use of sysenter.

> ---
>  arch/x86/xen/smp.c |   12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> index d1e4777..2a47241 100644
> --- a/arch/x86/xen/smp.c
> +++ b/arch/x86/xen/smp.c
> @@ -278,6 +278,18 @@ static void __init xen_smp_prepare_boot_cpu(void)
>  old memory can be recycled */
>   make_lowmem_page_readwrite(xen_initial_gdt);
>  
> +#ifdef CONFIG_X86_32
> + /*
> +  * Assure we use segments with user level access.
> +  * During switching of task these segments got not reloaded
> +  * so it could happen that userspace tasks get Xen ring1
> +  * selector causing exit with sysenter failures on next
> +  * userspace memory operation.
> +  */
> + loadsegment(ds, __USER_DS);
> + loadsegment(es, __USER_DS);
> +#endif
> +
>   xen_filter_cpu_maps();
>   xen_setup_vcpu_info_placement();
>   }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH] xen: Fix possible user space selector corruption

2013-10-04 Thread Andrew Cooper

On 04/10/13 14:20, Konrad Rzeszutek Wilk wrote:
> On Thu, Oct 03, 2013 at 01:51:32PM +0100, Frediano Ziglio wrote:
>> On Thu, 2013-10-03 at 11:04 +0100, Andrew Cooper wrote:
>>> On 03/10/13 09:24, Frediano Ziglio wrote:
>>>>
>>>> Bisecting the code commit 7076aada1040de4ed79a5977dbabdb5e5ea5e249 appears
>>>> to be the first one that have this issue.
>>>>
>>>> Signed-off-by: Frediano Ziglio 
>>> In terms of the correctness of the fix,
>>>
>>> Reviewed-by: Andrew Cooper 
> Should this also go in stable tree?

Very much so.  The change which exposed it for us was from 3.7 iirc, but
I believe it has been a latent bug for as long as the native early boot
code uses __USER_DS.

>>> However, I am not sure the comment is necessary.  The prevailing style
>>> is for no justification of loads of segment selectors on boot, and the
>>> comment itself refers simply to an interaction issue of 32bit on Xen
>>> when making use of sysenter.
>>>
>> Suggestion for the comment ??
>>
>> Frediano

My suggestion was to omit the comment entirely, or simplify it to just:

/* Xen starts us with XEN_FLAT_RING1_DS, but linux code expects __USER_DS */

Anyone who wants the full explanation can read the patch description.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] xen: Fix possible user space selector corruption

2013-10-07 Thread Andrew Cooper

On 07/10/13 10:48, Frediano Ziglio wrote:
> Due to the way kernel is initialized under Xen is possible that the
> ring1 selector used by the kernel for the boot cpu end up to be copied
> to userspace leading to segmentation fault in the userspace.
>
> Xen code in the kernel initialize no-boot cpus with correct selectors (ds
> and es set to __USER_DS) but the boot one keep the ring1 (passed by Xen).
> On task context switch (switch_to) we assume that ds, es and cs already
> point to __USER_DS and __KERNEL_CSso these selector are not changed.
>
> If processor is an Intel that support sysenter instruction sysenter/sysexit
> is used so ds and es are not restored switching back from kernel to
> userspace. In the case the selectors point to a ring1 instead of __USER_DS
> the userspace code will crash on first memory access attempt (to be
> precise Xen on the emulated iret used to do sysexit will detect and set ds
> and es to zero which lead to GPF anyway).
>
> Now if an userspace process call kernel using sysenter and get rescheduled
> (for me it happen on a specific init calling wait4) could happen that the
> ring1 selector is set to ds and es.
>
> This is quite hard to detect cause after a while these selectors are fixed
> (__USER_DS seems sticky).
>
> Bisecting the code commit 7076aada1040de4ed79a5977dbabdb5e5ea5e249 appears
> to be the first one that have this issue.
>
> Signed-off-by: Frediano Ziglio 

Reviewed-by: Andrew Cooper 

> ---
>  arch/x86/xen/smp.c |9 +
>  1 file changed, 9 insertions(+)
>
>
> Just changed comment on source code as suggested by Andrew Cooper.
>
>
> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> index d99cae8..6d89fcc 100644
> --- a/arch/x86/xen/smp.c
> +++ b/arch/x86/xen/smp.c
> @@ -245,6 +245,15 @@ static void __init xen_smp_prepare_boot_cpu(void)
>  old memory can be recycled */
>   make_lowmem_page_readwrite(xen_initial_gdt);
>  
> +#ifdef CONFIG_X86_32
> + /*
> +  * Xen starts us with XEN_FLAT_RING1_DS, but linux code
> +  * expects __USER_DS
> +  */
> + loadsegment(ds, __USER_DS);
> + loadsegment(es, __USER_DS);
> +#endif
> +
>   xen_filter_cpu_maps();
>   xen_setup_vcpu_info_placement();
>  }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2012-12-27 Thread Andrew Cooper

On 27/12/2012 07:53, Eric W. Biederman wrote:
> The syscall ABI still has the wrong semantics.
>
> Aka totally unmaintainable and umergeable.
>
> The concept of domU support is also strange.  What does domU support even 
> mean, when the dom0 support is loading a kernel to pick up Xen when Xen falls 
> over.

There are two requirements pulling at this patch series, but I agree
that we need to clarify them.

When dom0 loads a crash kernel, it is loading one for Xen to use.  As a
dom0 crash causes a Xen crash, having dom0 set up a kdump kernel for
itself is completely useless.  This ability is present in "classic Xen
dom0" kernels, but the feature is currently missing in PVOPS.

Many cloud customers and service providers want the ability for a VM
administrator to be able to load a kdump/kexec kernel within a
domain[1].  This allows the VM administrator to take more proactive
steps to isolate the cause of a crash, the state of which is most likely
discarded while tearing down the domain.  The result being that as far
as Xen is concerned, the domain is still alive, while the kdump
kernel/environment can work its usual magic.  I am not aware of any
feature like this existing in the past.

~Andrew

[1] http://lists.xen.org/archives/html/xen-devel/2012-11/msg01274.html

>
> I expect a lot of decisions about what code can be shared and what code can't 
> is going to be driven by the simple question what does the syscall mean.
>
> Sharing machine_kexec.c and relocate_kernel.S does not make much sense to me 
> when what you are doing is effectively passing your arguments through to the 
> Xen version of kexec.
>
> Either Xen has it's own version of those routines or I expect the Xen version 
> of kexec is buggy.   I can't imagine what sharing that code would mean.  By 
> the same token I can't any need to duplicate the code either.
>
> Furthermore since this is just passing data from one version of the syscall 
> to another I expect you can share the majority of the code across all 
> architectures that implement Xen.  The only part I can see being arch 
> specific is the Xen syscall stub.
>
> With respect to the proposed semantics of silently giving the kexec system 
> call different meaning when running under Xen,
> /sbin/kexec has to act somewhat differently when loading code into the Xen 
> hypervisor so there is no point not making that explicit in the ABI.
>
> Eric
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-02 Thread Andrew Cooper


On 27/12/12 18:02, Eric W. Biederman wrote:

Andrew Cooper  writes:


On 27/12/2012 07:53, Eric W. Biederman wrote:

The syscall ABI still has the wrong semantics.

Aka totally unmaintainable and umergeable.

The concept of domU support is also strange.  What does domU support even mean, 
when the dom0 support is loading a kernel to pick up Xen when Xen falls over.

There are two requirements pulling at this patch series, but I agree
that we need to clarify them.

It probably make sense to split them apart a little even.




Thinking about this split, there might be a way to simply it even more.

/sbin/kexec can load the "Xen" crash kernel itself by issuing hypercalls 
using /dev/xen/privcmd.  This would remove the need for the dom0 kernel 
to distinguish between loading a crash kernel for itself and loading a 
kernel for Xen.


Or is this just a silly idea complicating the matter?

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-22 Thread Andrew Cooper

On 22/11/12 17:47, H. Peter Anvin wrote:
> The other thing that should be considered here is how utterly 
> preposterous the notion of doing in-guest crash dumping is in a system 
> that contains a hypervisor.  The reason for kdump is that on bare metal 
> there are no other options, but in a hypervisor system the right thing 
> should be for the hypervisor to do the dump (possibly spawning a clean 
> I/O domain if the I/O domain is necessary to access the media.)
>
> There is absolutely no reason to have a crashkernel sitting around in 
> each guest, consuming memory, and possibly get corrupt.
>
>   -hpa
>

I agree that regular guests should not be using the kexec/kdump. 
However, this patch series is required for allowing a pvops kernel to be
a crash kernel for Xen, which is very important from dom0/Xen's point of
view.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-22 Thread Andrew Cooper

On 22/11/2012 17:47, H. Peter Anvin wrote:
> The other thing that should be considered here is how utterly 
> preposterous the notion of doing in-guest crash dumping is in a system 
> that contains a hypervisor.  The reason for kdump is that on bare metal 
> there are no other options, but in a hypervisor system the right thing 
> should be for the hypervisor to do the dump (possibly spawning a clean 
> I/O domain if the I/O domain is necessary to access the media.)
>
> There is absolutely no reason to have a crashkernel sitting around in 
> each guest, consuming memory, and possibly get corrupt.
>
>   -hpa
>

(Your reply to my email which I can see on the xen devel archive appears
to have gotten lost somewhere inside the citrix email system, so
apologies for replying out of order)

The kdump kernel loaded by dom0 is for when Xen crashes, not for when
dom0 crashes (although a dom0 crash does admittedly lead to a Xen crash)

There is no possible way it could be a separate domain; Xen completely
ceases to function as soon as jumps to the entry point of the kdump image.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-22 Thread Andrew Cooper

On 23/11/2012 01:38, H. Peter Anvin wrote:
> I still don't really get why it can't be isolated from dom0, which would make 
> more sense to me, even for a Xen crash.
>

The crash region (as specified by crashkernel= on the Xen command line)
is isolated from dom0.

dom0 (using the kexec utility etc) has the task of locating the Xen
crash notes (using the kexec hypercall interface), constructing a binary
blob containing kernel, initram and gubbins, and asking Xen to put this
blob in the crash region (again, using the kexec hypercall interface).

I do not see how this is very much different from the native case
currently (although please correct me if I am misinformed).  Linux has
extra work to do by populating /proc/iomem with the Xen crash regions
boot (so the kexec utility can reference their physical addresses when
constructing the blob), and should just act as a conduit between the
kexec system call and the kexec hypercall to load the blob.

For within-guest kexec/kdump functionality, I agree that it is barking
mad.  However, we do see cloud operators interested in the idea so VM
administrators can look after their crashes themselves.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] xen: Fix stack corruption in xen_failsafe_callback for 32bit PVOPS guests.

2013-01-16 Thread Andrew Cooper

This fixes CVE-2013-0190 / XSA-40

There has been an error on the xen_failsafe_callback path for failed
iret, which causes the stack pointer to be wrong when entering the
iret_exc error path.  This can result in the kernel crashing.

In the classic kernel case, the relevant code looked a little like:

popl %eax  # Error code from hypervisor
jz 5f
addl $16,%esp
jmp iret_exc   # Hypervisor said iret fault
5:  addl $16,%esp
   # Hypervisor said segment selector fault

Here, there are two identical addls on either option of a branch which
appears to have been optimised by hoisting it above the jz, and
converting it to an lea, which leaves the flags register unaffected.

In the PVOPS case, the code looks like:

popl_cfi %eax # Error from the hypervisor
lea 16(%esp),%esp # Add $16 before choosing fault path
CFI_ADJUST_CFA_OFFSET -16
jz 5f
addl $16,%esp # Incorrectly adjust %esp again
jmp iret_exc

It is possible unprivileged userspace applications to cause this
behaviour, for example by loading an LDT code selector, then changing
the code selector to be not-present.  At this point, there is a race
condition where it is possible for the hypervisor to return back to
userspace from an interrupt, fault on its own iret, and inject a
failsafe_callback into the kernel.

This bug has been present since the introduction of Xen PVOPS support
in commit 5ead97c84 (xen: Core Xen implementation), in 2.6.23.

Signed-off-by: Frediano Ziglio 
Signed-off-by: Andrew Cooper 
Cc: sta...@vger.kernel.org

---
Cc: oss-secur...@lists.openwall.com
Cc: Konrad Rzeszutek Wilk 
Cc: xen-de...@lists.xen.org
Cc: secur...@xen.org
---
 arch/x86/kernel/entry_32.S |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index ff84d54..6ed91d9 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1065,7 +1065,6 @@ ENTRY(xen_failsafe_callback)
lea 16(%esp),%esp
CFI_ADJUST_CFA_OFFSET -16
jz 5f
-   addl $16,%esp
jmp iret_exc
 5: pushl_cfi $-1 /* orig_ax = -1 => not a system call */
SAVE_ALL
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-07 Thread Andrew Cooper


On 07/01/13 10:25, Ian Campbell wrote:

On Fri, 2013-01-04 at 19:11 +, Konrad Rzeszutek Wilk wrote:

On Fri, Jan 04, 2013 at 06:07:51PM +0100, Daniel Kiper wrote:

Because current KEXEC_CMD_kexec_load does not load kernel
image and other things into Xen memory. It means that it
should live somewhere in dom0 Linux kernel memory.

We could have a very simple hypercall which would have:

struct fancy_new_hypercall {
xen_pfn_t payload; // IN

This would have to be XEN_GUEST_HANDLE(something) since userspace cannot
figure out what pfns back its memory. In any case since the hypervisor
is going to want to copy the data into the crashkernel space a virtual
address is convenient to have.


ssize_t len; // IN
#define DATA (1<<1)
#define DATA_EOF (1<<2)
#define DATA_KERNEL (1<<3)
#define DATA_RAMDISK (1<<4)
unsigned int flags; // IN
unsigned int status; // OUT
};

which would in a loop just iterate over the payloads and
let the hypervisor stick it in the crashkernel space.

This is all hand-waving of course. There probably would be a need
to figure out how much space you have in the reserved Xen's
'crashkernel' memory region too.

This is probably a mad idea but it's Monday morning and I'm sleep
deprived so I'll throw it out there...

What about adding DOMID_KEXEC (similar DOMID_IO etc)? This would allow
dom0 to map the kexec memory space with the usual privcmd mmap
hypercalls and build things in it directly.

OK, I suspect this might not be practical for a variety of reasons (lack
of a p2m for such domains so no way to find out the list of mfns, dom0
userspace simply doesn't have sufficient context to write sensible
things here, etc) but maybe someone has a better head on today...

Ian.



Given that /sbin/kexec creates a binary blob in memory, surely the most 
simple thing is to get it to suitably mlock() the region and give a list 
of VAs to the hypervisor.


This way, Xen can properly take care of what it does with information 
and where.  For example, at the moment, allowing dom0 to choose where 
gets overwritten in the Xen crash area is a recipe for disaster if a 
crash occurs midway through loading/reloading the crash kernel.


~Andrew

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.8-rc0

2016-07-27 Thread Andrew Cooper

On 27/07/16 19:42, Linus Torvalds wrote:
> On Wed, Jul 27, 2016 at 6:45 AM, David Vrabel  wrote:
>> Shannon Zhao (16):
>>   Xen: ACPI: Hide UART used by Xen
> So this caused a trivial conflict. No biggie, it wasn't bad and the
> patch was acked by Rafael. However, looking at it made me somewhat
> unhappy.
>
> Should the device entry in ACPI really be hidden unconditionally? In
> particular, if we are *not* running under virtualization, it sounds
> wrong to hide it.
>
> Comments? Am I missing something?

The purpose of the ACPI STAO table (Status Override table, ratified in
ACPI 6.0) is to list items elsewhere in the ACPI namespace which should
be completely ignored.  It is used in cases where it is impossible or
prohibitive to edit the system AML.

The patch itself only hides the UART if instructed to do so by the STAO
table (last hunk).

~Andrew

Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.8-rc0

2016-07-27 Thread Andrew Cooper

On 28/07/2016 00:46, Rafael J. Wysocki wrote:
> On Wednesday, July 27, 2016 04:18:32 PM Linus Torvalds wrote:
>> On Wed, Jul 27, 2016 at 4:09 PM, Rafael J. Wysocki  
>> wrote:
>>> The STAO definition document:
>>>
>>> http://wiki.xenproject.org/mediawiki/images/0/02/Status-override-table.pdf
>>>
>>> requires as to "operate as if that device does not exist", quite literally.
>> Well, first off, documentation is one thing, actually changing
>> behavior is something entirely different.
>>
>> Theory and practice are *not* the same.
> Well, the STAO thing is totally new, so we have the documentation only ATM.
>
>> The other worry I have is that I'd be happier if it's still visible in
>> /sys/bus/acpi/ etc. Again, it's one thing to not react to it
>> programmatically, and another thing entirely to actually hide the
>> information from the rest of the system.
>>
>> If I read that patch right, it will be hidden from sysfs too. But
>> Maybe I'm mistaken.
> You're right.
>
> Avoiding to enumerate it entirely is somewhat simpler, because it allows
> us to avoid some special casing in a few places IIRC.
>
> I guess we can ask the author of the commit in question to come up with a
> patch to unhide that device and we'll see how that looks like.

Well - the entire purpose of STAO is to list system resources which are
genuinely in use by the hypervisor, and genuinely can't be mapped or
used by the kernel (these latter two frequently resulting in crashes or
hangs at early boot).

Identifying that such devices exist is reasonable (it is certainly
possibly by dumping the raw acpi tables), but any sysfs tweakable is
going to end in misery.

~Andrew

Re: [PATCH] prctl,x86 Add PR_[GET|SET]_CPUID for controlling the CPUID instruction.

2016-09-14 Thread Andrew Cooper

On 14/09/2016 20:23, Boris Ostrovsky wrote:
> On 09/14/2016 02:52 PM, Andy Lutomirski wrote:
>> On Tue, Sep 13, 2016 at 11:13 PM, Kyle Huey  wrote:
>>> On Mon, Sep 12, 2016 at 9:56 AM, Andy Lutomirski  
>>> wrote:
 You should explicitly check that, if the
 feature is set under Xen PV, then the MSR actually works as
 advertised.  This may require talking to the Xen folks to make sure
 you're testing the right configuration.
>>> This is interesting.  When running under Xen PV the kernel is allowed
>>> to read the real value of MSR_PLATFORM_INFO and see that CPUID
>>> faulting is supported.  But as you suggested, writing to
>>> MSR_MISC_FEATURES_ENABLES doesn't actually enable CPUID faulting, at
>>> least not in any way that works.
>>>
>>> It's not obvious to me how to test this, because when this feature
>>> works, CPUID only faults in userspace, not in the kernel.  Is there
>>> existing code somewhere that runs tests like this in userspace?
>>>
>> Andrew, Boris: should we expect Xen PV to do anything sensible when we
>> write to MSR_PLATFORM_INFO to turn on CPUID faulting?  Should the Xen
>> PV rdmsr hooks or perhaps the hypervisor mask out the feature if it
>> isn't going to be supported?
> The hypervisor uses CPUID faulting so we shouldn't advertise this
> feature to guests.

In the case that the hardware has faulting, or for any HVM guest, the
extra cost to making the feature available to the guest is a single
conditional test in the cpuid path.  This is about as close to zero as a
feature gets.  We really should be offering the feature to guests, and
have it actually working.  The issue here is that it is leaking when we
weren't intending to offer it.

~Andrew

Re: [PATCH] prctl,x86 Add PR_[GET|SET]_CPUID for controlling the CPUID instruction.

2016-09-14 Thread Andrew Cooper

On 14/09/2016 19:52, Andy Lutomirski wrote:
> On Tue, Sep 13, 2016 at 11:13 PM, Kyle Huey  wrote:
>> On Mon, Sep 12, 2016 at 9:56 AM, Andy Lutomirski  wrote:
>>> You should explicitly check that, if the
>>> feature is set under Xen PV, then the MSR actually works as
>>> advertised.  This may require talking to the Xen folks to make sure
>>> you're testing the right configuration.
>> This is interesting.  When running under Xen PV the kernel is allowed
>> to read the real value of MSR_PLATFORM_INFO and see that CPUID
>> faulting is supported.  But as you suggested, writing to
>> MSR_MISC_FEATURES_ENABLES doesn't actually enable CPUID faulting, at
>> least not in any way that works.
>>
>> It's not obvious to me how to test this, because when this feature
>> works, CPUID only faults in userspace, not in the kernel.  Is there
>> existing code somewhere that runs tests like this in userspace?
>>
> Andrew, Boris: should we expect Xen PV to do anything sensible when we
> write to MSR_PLATFORM_INFO to turn on CPUID faulting?

It will drop the write, so "No" is the answer to your question.

> Should the Xen
> PV rdmsr hooks or perhaps the hypervisor mask out the feature if it
> isn't going to be supported?

Yes.

Sadly, whomever hacked these things together in the early days decided
that the most simple solution to getting guests to boot was to allow all
domains default-read access across the CPUID and MSR space, blacklisting
out specific areas known to cause problems.  I am in the process of
sorting this stupidity^W "feature" out, but it is taking a while.

What is the purpose of the check?  I think it might be easiest to just
do the native thing, and raise a bug in general against Xen citing
"incorrect behaviour with respect to MSR_PLATFORM_INFO", get it fixed in
stable trees and pretend that this breakage never happened.

Short of having a small userspace stub check, I can't see any way to
actually test this, and I certainly would prefer to avoid workarounds
which end up like the OXSAVE detection, which is a complete disaster for
both Linux and Xen.

~Andrew

Re: [PATCH] prctl,x86 Add PR_[GET|SET]_CPUID for controlling the CPUID instruction.

2016-09-14 Thread Andrew Cooper

On 14/09/2016 20:36, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 12:28 PM, Andrew Cooper
>  wrote:
>> On 14/09/2016 20:23, Boris Ostrovsky wrote:
>>> On 09/14/2016 02:52 PM, Andy Lutomirski wrote:
>>>> On Tue, Sep 13, 2016 at 11:13 PM, Kyle Huey  wrote:
>>>>> On Mon, Sep 12, 2016 at 9:56 AM, Andy Lutomirski  
>>>>> wrote:
>>>>>> You should explicitly check that, if the
>>>>>> feature is set under Xen PV, then the MSR actually works as
>>>>>> advertised.  This may require talking to the Xen folks to make sure
>>>>>> you're testing the right configuration.
>>>>> This is interesting.  When running under Xen PV the kernel is allowed
>>>>> to read the real value of MSR_PLATFORM_INFO and see that CPUID
>>>>> faulting is supported.  But as you suggested, writing to
>>>>> MSR_MISC_FEATURES_ENABLES doesn't actually enable CPUID faulting, at
>>>>> least not in any way that works.
>>>>>
>>>>> It's not obvious to me how to test this, because when this feature
>>>>> works, CPUID only faults in userspace, not in the kernel.  Is there
>>>>> existing code somewhere that runs tests like this in userspace?
>>>>>
>>>> Andrew, Boris: should we expect Xen PV to do anything sensible when we
>>>> write to MSR_PLATFORM_INFO to turn on CPUID faulting?  Should the Xen
>>>> PV rdmsr hooks or perhaps the hypervisor mask out the feature if it
>>>> isn't going to be supported?
>>> The hypervisor uses CPUID faulting so we shouldn't advertise this
>>> feature to guests.
>> In the case that the hardware has faulting, or for any HVM guest, the
>> extra cost to making the feature available to the guest is a single
>> conditional test in the cpuid path.  This is about as close to zero as a
>> feature gets.  We really should be offering the feature to guests, and
>> have it actually working.  The issue here is that it is leaking when we
>> weren't intending to offer it.
> As long as Xen can fix this one way or the other in reasonably short
> order, I think I'm okay with having Linux incorrectly think it works
> on old Xen hypervisors.

For now, unilaterally hiding CPUID faulting is easy, and simple to backport.

Making the feature available for guests to use is slightly more tricky,
as the toolstack still depends on not being faulted to construct HVM
domains properly.  This is the subject of my current CPUID project,
which will result in dom0 being no more special than any other domain
(in terms of hypervisor-side cpuid handling).

~Andrew

Re: [Xen-devel] [PATCH 2/2] x86/xen/efi: Init only efi struct members used by Xen

2017-06-21 Thread Andrew Cooper

On 20/06/2017 21:14, Daniel Kiper wrote:
> Current approach, wholesale efi struct initialization from efi_xen, is not
> good. Usually if new member is defined then it is properly initialized in
> drivers/firmware/efi/efi.c but not in arch/x86/xen/efi.c. As I saw it happened
> a few times until now. So, let's initialize only efi struct members used by
> Xen to avoid such issues in the future.
>
> Signed-off-by: Daniel Kiper 
> ---
>  arch/x86/xen/efi.c |   45 -
>  1 file changed, 12 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/xen/efi.c b/arch/x86/xen/efi.c
> index 30bb2e8..01b9faf 100644
> --- a/arch/x86/xen/efi.c
> +++ b/arch/x86/xen/efi.c
> @@ -54,38 +54,6 @@
>   .tables = EFI_INVALID_TABLE_ADDR  /* Initialized later. */
>  };
>  
> -static const struct efi efi_xen __initconst = {
> - .systab   = NULL, /* Initialized later. */
> - .runtime_version  = 0,/* Initialized later. */
> - .mps  = EFI_INVALID_TABLE_ADDR,
> - .acpi = EFI_INVALID_TABLE_ADDR,
> - .acpi20   = EFI_INVALID_TABLE_ADDR,
> - .smbios   = EFI_INVALID_TABLE_ADDR,
> - .smbios3  = EFI_INVALID_TABLE_ADDR,
> - .sal_systab   = EFI_INVALID_TABLE_ADDR,
> - .boot_info= EFI_INVALID_TABLE_ADDR,
> - .hcdp = EFI_INVALID_TABLE_ADDR,
> - .uga  = EFI_INVALID_TABLE_ADDR,
> - .uv_systab= EFI_INVALID_TABLE_ADDR,
> - .fw_vendor= EFI_INVALID_TABLE_ADDR,
> - .runtime  = EFI_INVALID_TABLE_ADDR,
> - .config_table = EFI_INVALID_TABLE_ADDR,
> - .get_time = xen_efi_get_time,
> - .set_time = xen_efi_set_time,
> - .get_wakeup_time  = xen_efi_get_wakeup_time,
> - .set_wakeup_time  = xen_efi_set_wakeup_time,
> - .get_variable = xen_efi_get_variable,
> - .get_next_variable= xen_efi_get_next_variable,
> - .set_variable = xen_efi_set_variable,
> - .query_variable_info  = xen_efi_query_variable_info,
> - .update_capsule   = xen_efi_update_capsule,
> - .query_capsule_caps   = xen_efi_query_capsule_caps,
> - .get_next_high_mono_count = xen_efi_get_next_high_mono_count,
> - .reset_system = xen_efi_reset_system,
> - .set_virtual_address_map  = NULL, /* Not used under Xen. */
> - .flags= 0 /* Initialized later. */
> -};
> -
>  static efi_system_table_t __init *xen_efi_probe(void)
>  {
>   struct xen_platform_op op = {
> @@ -102,7 +70,18 @@ static efi_system_table_t __init *xen_efi_probe(void)
>  
>   /* Here we know that Xen runs on EFI platform. */
>  
> - efi = efi_xen;
> + efi.get_time = xen_efi_get_time;
> + efi.set_time = xen_efi_set_time;
> + efi.get_wakeup_time = xen_efi_get_wakeup_time;
> + efi.set_wakeup_time = xen_efi_set_wakeup_time;
> + efi.get_variable = xen_efi_get_variable;
> + efi.get_next_variable = xen_efi_get_next_variable;
> + efi.set_variable = xen_efi_set_variable;
> + efi.query_variable_info = xen_efi_query_variable_info;
> + efi.update_capsule = xen_efi_update_capsule;
> + efi.query_capsule_caps = xen_efi_query_capsule_caps;
> + efi.get_next_high_mono_count = xen_efi_get_next_high_mono_count;
> + efi.reset_system = xen_efi_reset_system;

This presumably means that the system default values are already present
in efi at the point that we overwrite some Xen specifics?

If so, surely you need to retain the clobbering of set_virtual_address_map ?

~Andrew

Re: [Xen-devel] [PATCH v2 1/2] doc, xen: document hypervisor sysfs nodes for xen

2017-05-26 Thread Andrew Cooper

On 26/05/17 13:56, Juergen Gross wrote:
> Today only a few sysfs nodes under /sys/hypervisor/ are documented
> for Xen in Documentation/ABI/testing/sysfs-hypervisor-pmu.
>
> Add the remaining Xen sysfs nodes under /sys/hypervisor/ in a new
> file Documentation/ABI/stable/sysfs-hypervisor-xen and add the Xen
> specific sysfs docs to the MAINTAINERS file.
>
> Signed-off-by: Juergen Gross 
> ---
> V2:
>   - rename file to Documentation/ABI/stable/sysfs-hypervisor-xen in
> order to reflect Xen dependency
>   - leave pmu entries in old file under testing (Boris Ostrovsky)
> ---
>  Documentation/ABI/stable/sysfs-hypervisor-xen | 107 
> ++
>  MAINTAINERS   |   2 +
>  2 files changed, 109 insertions(+)
>  create mode 100644 Documentation/ABI/stable/sysfs-hypervisor-xen
>
> diff --git a/Documentation/ABI/stable/sysfs-hypervisor-xen 
> b/Documentation/ABI/stable/sysfs-hypervisor-xen
> new file mode 100644
> index ..97e4171508c4
> --- /dev/null
> +++ b/Documentation/ABI/stable/sysfs-hypervisor-xen
> @@ -0,0 +1,107 @@
> +What:/sys/hypervisor/compilation/compile_date
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Contains the build time stamp of the Xen hypervisor
> +
> +What:/sys/hypervisor/compilation/compiled_by
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Contains information who built the Xen hypervisor
> +
> +What:/sys/hypervisor/compilation/compiler
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Compiler which was used to build the Xen hypervisor
> +
> +What:/sys/hypervisor/properties/capabilities
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Space separated list of supported guest system types. Each type
> + is in the format: -.-
> + With:
> + : "xen" -- x86: paravirtualized, arm: standard
> +  "hvm" -- x86 only: full virtualized
> + : major guest interface version
> + : minor guest interface version
> + :  architecture, e.g.:
> +  "x86_32": 32 bit x86 guest without PAE
> +  "x86_32p": 32 bit x86 guest with PAE
> +  "x86_64": 64 bit x86 guest
> +  "armv7l": 32 bit arm guest
> +  "aarch64": 64 bit arm guest
> +
> +What:/sys/hypervisor/properties/changeset
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Changeset of the hypervisor (git commit)
> +
> +What:/sys/hypervisor/properties/features
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Features the Xen hypervisor supports for the guest as defined
> + in include/xen/interface/features.h printed as a hex value.
> +
> +What:/sys/hypervisor/properties/pagesize
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Default page size of the hypervisor printed as a hex value.
> +
> +What:/sys/hypervisor/properties/virtual_start
> +Date:March 2009
> +KernelVersion:   2.6.30
> +Contact: xen-de...@lists.xenproject.org
> +Description: If running under Xen:
> + Virtual address of the hypervisor as a hex value.

All properties above here, as well as "extra" below, may be deliberately
elided by Xen.  Is it perhaps worth having a general note to this effect
at the top of the file?

~Andrew

Re: [Xen-devel] ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses"): kernel BUG at arch/x86/mm/physaddr.c:79!

2017-10-26 Thread Andrew Cooper

On 26/10/17 20:29, Sander Eikelenboom wrote:
> On 26/10/17 19:49, Craig Bergstrom wrote:
>> Sander, thanks for the details, they've been very useful.
>>
>> I suspect that your host system's mem=2048M parameter is causing the
>> problem.  Any chance you can confirm by removing the parameter and
>> running the guest code path?
> I removed it, but kept the hypervisor limiting dom0 memory to 2046M intact 
> (in grub using the xen bootcmd: 
> "multiboot   /xen-4.10.gz  dom0_mem=2048M,max:2048M ."
>
> Unfortunately that doesn't change anything, the guest still fails to start 
> with the same errors.
>
>> More specifically, since you're telling the kernel that it's high
>> memory address is at 2048M and your device is at 0xfe1fe000 (~4G), the
>> new mmap() limits are preventing you from mapping addresses that are
>> explicitly disallowed by the parameter.
>>
> Which would probably mean the current patch prohibits hard limiting the dom0 
> memory to a certain value (below 4G)
> at least in combination with PCI-passthrough. So the only thing left would be 
> to have no hard memory restriction on dom0
> and rely on auto-ballooning, but I'm not a great fan of that.
>
> I don't know how KVM handles setting memory limits for the host system, but 
> perhaps it suffers from the same issue.
>
> I also tried the patch from one of your last mails to make the check "less 
> strict", 
> but still get the same errors (when using the hard memory limits).

dom0_mem=2048M,max:2048M is used to describe how much RAM the guest has,
not its maximum address.  (Whether this is how PVops actually interprets
the information and passes it into Linux is a different matter.  I will
have to defer to Boris/Juergen on that side of things.)

For RAM, PV guests will get a scattering of frames wherever Xen chooses
to allocate them, and are likely to not be contiguous or adjacent.

For devices, PV guests do get mappings to the real system BARs, which
will be the real low and high MMIO holes.

~Andrew

Re: [Xen-devel] [RFC PATCH] Use vAPIC when doing IPI for PVHVM guests.

2015-10-08 Thread Andrew Cooper

On 08/10/15 06:05, Juergen Gross wrote:
> On 10/07/2015 10:21 PM, Konrad Rzeszutek Wilk wrote:
>> Hey,
>>
>> I was running some tools in which we would heavily do rescheduling
>> of events - and realized to my surprise that the event channels (and
>> the hypercall) would slow things down. If I used the vAPIC with its
>> IPI support (so no VMEXIT) I got much much better performance.
>>
>> Now this is an RFC because:
>>   1). I hadn't verified from the xentrace  how much less VMEXITS we get.
>>  But I remember Boris's patches and they gave at least 10%.
>>  I think this will get the same performance or even better.
>>
>>   2). I don't know what to do with migration. That is if the guest
>>  migrates to older hardware it needs to recheck this I presume?
>
> Same problem applies to many other features. In case you want to
> migrate to a machine with less features you'd have to mask those
> features in the cpuid data of the domain.

Those leaves in particular are from the HV set rather than the plain
featureset.  One way or another there will be an APIC to use, but those
features are expected to appear/disappear across migrate to indicate
whether hardware assistance is in use or not.

Therefore, they should be resampled and re-acted-upon in the resume path.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] x86/pti: Do not enable PTI on fixed Intel processors

2018-01-23 Thread Andrew Cooper

On 23/01/18 18:45, Alan Cox wrote:
> On Tue, 23 Jan 2018 16:52:55 +
> David Woodhouse  wrote:
>
>> When they advertise the IA32_ARCH_CAPABILITIES MSR and it has the RDCL_NO
>> bit set, they don't need KPTI either.
> This is starting to get messy because we will eventually need to integrate
>
> AMD processors-   no meltdown but spectre
> VIA processors-   probably no vulnerabilities at
>   least on the old ones
> Intel with ND set -   No meltdown
> Anybody with no speculation - No meltdown, no spectre, no id bit
>
>
>
> and it expands a lot with all sorts of 32bit processors. Would it make
> more sense to make it table driven or do we want a separate function so
> we can do:
>
> if (!in_order_cpu()) {
> }
>
> around the whole lot ? I'm guessing the latter makes sense then
> somethhing like this patch I'm running on my old atom widgets in 64bit
> mode
>
> static __initdata struct x86_cpu_id cpu_in_order[] = {
> { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CEDARVIEW, X86_FEATURE_ANY },
> { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_CLOVERVIEW, X86_FEATURE_ANY },
> { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_LINCROFT, X86_FEATURE_ANY },
> { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PENWELL, X86_FEATURE_ANY },
> { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_PINEVIEW, X86_FEATURE_ANY },
> {}
> };
>
> static int in_order_cpu(void)
> {
>   /* Processors with CPU id etc */
>   if (x86_match_cpu(cpu_in_order))
>   return 1;
>   /* Other rules here */
>   return 0;
> }

Why does in-order vs out-of-order matter?

There are leaky SP3 gadgets which satisfy in-order requirements, so long
as the processor is capable of speculating 3 instructions past an
unresolved branch.

What would (at a guess) save an in-order speculative processor from
being vulnerable is if memory reads are issued and resolve in program
order, but in that case, it is not the in-order property of the
processor which makes it safe.

~Andrew

Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?

2017-05-13 Thread Andrew Cooper

On 13/05/2017 20:49, PGNet Dev wrote:
> On 5/13/17 12:38 PM, Andrew Cooper wrote:
>> What is the issue here?
>>
>> Xen owns (and may use) any HPETs in the system.  They are purposefully
>> unavailable to even dom0.
> The issue is that, when booting to Xen, hpet is not advertised as an 
> available clocksource, AND reports the hpet boot error pointed out by Randy.
>
> Following 
>
>   
> https://wiki.xen.org/wiki/Xen_power_management#HPET_as_broadcast_timer_source_.28clocksource.29_.3D
>
> there's discussion there re: 'if HPET is available / not missing'.
>
> It appears to be available only booting to non-Xen.
>
> What specific indication does one look for that Xen's using available hpet?
>

Ok.  Lack of a clocksource is to be expected.

The reason why the HPETs are unavailable is that dom0 is not a position
to program them; dom0 doesn't know what Xen has set up in the IDT.

Use `xl dmesg` to get to the hypervisor dmesg log.  You should see
mention of the HPET in there if Xen has found it.

~Andrew

Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?

2017-05-13 Thread Andrew Cooper

On 13/05/2017 20:28, Randy Dunlap wrote:
> On 05/13/17 11:26, PGNet Dev wrote:
>> On 5/13/17 10:41 AM, Randy Dunlap wrote:
>>> [adding HPET driver maintainer]
>> Thanks
>>
>>> A couple of comments below...
 In BIOS, HPET's enabled.
>>> How about if you just boot Linux without Xen?  Does HPET show up then?
>> yes, it appears so:
>>
>> cat devices/system/clocksource/clocksource0/available
>>   tsc hpet acpi_pm
> Adding xen mailing list:
>
> Is HPET support a known issue in Xen?

What is the issue here?

Xen owns (and may use) any HPETs in the system.  They are purposefully
unavailable to even dom0.

~Andrew

Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?

2017-05-13 Thread Andrew Cooper

On 13/05/2017 21:05, PGNet Dev wrote:
> On 5/13/17 12:59 PM, Andrew Cooper wrote:
>> Ok.  Lack of a clocksource is to be expected.
>>
>> The reason why the HPETs are unavailable is that dom0 is not a position
>> to program them; dom0 doesn't know what Xen has set up in the IDT.
>>
>> Use `xl dmesg` to get to the hypervisor dmesg log.  You should see
>> mention of the HPET in there if Xen has found it.
>
> back to the error at hand ...
>
>  xl dmesg | grep -i hpet
>   [1.365876] hpet_acpi_add: no address or irqs in _CRS
>   [1.365876] hpet_acpi_add: no address or irqs in _CRS
>
> again, only present when booting with Xen.
>
> same kernel, no Xen, no such error.

We don't have code like that in upstream Xen.  No function with that
name, or a string which looks like that error message.

http://marc.info/?l=linux-kernel&m=149464267427111&w=2 indicates that
you are using a SuSE hypervisor.

Jan/Juergen: Any ideas? This looks as if it is something local to your
patch queue.

~Andrew

Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?

2017-05-14 Thread Andrew Cooper

On 14/05/17 00:17, PGNet Dev wrote:
> On 5/13/17 3:15 PM, Valentin Vidic wrote:
>> Try booting without 'hpet=force,verbose clocksource=hpet' and it should
>> select xen by default:
> Nope. Well, not quite ...
>
> With both 
>
>   'hpet=force,verbose clocksource=hpet'
>
> removed, I end up with
>
>   cat /sys/devices/system/clocksource/clocksource0/available_clocksource
>   tsc xen
>   cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>   tsc
>
> But with 
>
>   clocksource=xen
>
> *explicitly* added
>
>   cat /sys/devices/system/clocksource/clocksource0/available_clocksource
>   tsc xen
>   cat /sys/devices/system/clocksource/clocksource0/current_clocksourcexen
>   xen
>
> and in *console*, NOT dmesg, output,
>
>   grep -i hpet tmp.txt
>   (XEN) ACPI: HPET 9E8298F8, 0038 (r1 SUPERM SMCI--MB  1072009 
> AMI.5)
>   (XEN) ACPI: HPET id: 0x8086a701 base: 0xfed0
>   (XEN) [VT-D] MSI HPET: :f0:0f.0
>   (XEN) Platform timer is 14.318MHz HPET
>   [0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM 
> SMCI--MB 01072009 AMI. 00
>   [0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0
>   [0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM 
> SMCI--MB 01072009 AMI. 00
>   [0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0
>   [8.515245] hpet_acpi_add: no address or irqs in _CRS
>   [8.515245] hpet_acpi_add: no address or irqs in _CRS
>   (XEN) [2017-05-13 23:04:27] HVM1 save: HPET
>
>
>
> and
>
>   dmesg | grep -i clocksource | grep -v line:
>   [0.00] clocksource: refined-jiffies: mask: 0x 
> max_cycles: 0x, max_idle_ns: 7645519600211568 ns
>   [0.004000] clocksource: xen: mask: 0x 
> max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
>   [0.375709] clocksource: jiffies: mask: 0x 
> max_cycles: 0x, max_idle_ns: 764504178510 ns
>   [4.656634] clocksource: Switched to clocksource xen
>   [8.912897] clocksource: tsc: mask: 0x 
> max_cycles: 0x2c94dffea94, max_idle_ns: 440795361700 ns
>
> jiffies, now? hm. no idea where that came from. and why the 'tsc' ?
>
> So I'm still unclear -- is this^ now, correctly "all" using MSI/HPET?

What are you trying to achieve?  It is still not clear despite all on
this thread.

The Linux HEPT error messages are non-ideal, but there is no way dom0
will ever be able to use clocksource=hpet when running under Xen.

~Andrew

Re: [PATCH] x86/retpoline: Fill RSB on context switch for affected CPUs

2018-01-12 Thread Andrew Cooper

On 12/01/18 17:49, David Woodhouse wrote:
> When we context switch from a shallow call stack to a deeper one, as we
> 'ret' up the deeper side we may encounter RSB entries (predictions for
> where the 'ret' goes to) which were populated in userspace. This is
> problematic if we have neither SMEP nor KPTI (the latter of which marks
> userspace pages as NX for the kernel), as malicious code in userspace
> may then be executed speculatively. So overwrite the CPU's return
> prediction stack with calls which are predicted to return to an infinite
> loop, to "capture" speculation if this happens. This is required both
> for retpoline, and also in conjunction with IBRS for !SMEP && !KPTI.
>
> On Skylake+ the problem is slightly different, and an *underflow* of the
> RSB may cause errant branch predictions to occur. So there it's not so
> much overwrite, as *filling* the RSB to attempt to prevent it getting
> empty. This is only a partial solution for Skylake+ since there are many
> other conditions which may result in the RSB becoming empty. The full
> solution on Skylake+ is to use IBRS, which will prevent the problem even
> when the RSB becomes empty. With IBRS, the RSB-stuffing will not be
> required on context switch.

If you unconditionally fill the RSB on every entry to supervisor mode,
then there are never guest-controlled RSB values to be found.

With that property (and IBRS to protect Skylake+), you shouldn't need
RSB filling anywhere in the middle.

~Andrew

Re: [PATCH] x86/xen: init %gs very early to avoid page faults with stack protector

2018-02-01 Thread Andrew Cooper

On 01/02/18 12:16, Juergen Gross wrote:
> When running as Xen pv guest %gs is initialized some time after
> C code is started. Depending on stack protector usage this might be
> too late, resulting in page faults.
>
> So setup %gs and MSR_GS_BASE in assembly code already.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/xen/xen-head.S | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
> index 497cc55a0c16..b47d87076efb 100644
> --- a/arch/x86/xen/xen-head.S
> +++ b/arch/x86/xen/xen-head.S
> @@ -9,7 +9,9 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -35,6 +37,18 @@ ENTRY(startup_xen)
>   mov %_ASM_SI, xen_start_info
>   mov $init_thread_union+THREAD_SIZE, %_ASM_SP
>  
> + /* Set up %gs.
> +  *
> +  * The base of %gs always points to the bottom of the irqstack
> +  * union.  If the stack protector canary is enabled, it is
> +  * located at %gs:40.  Note that, on SMP, the boot cpu uses
> +  * init data section till per cpu areas are set up.
> +  */
> + movl$MSR_GS_BASE,%ecx
> + movq$INIT_PER_CPU_VAR(irq_stack_union),%rax
> + cdq
> + wrmsr

You surely want a #ifdef __x86_64__ ?  This path is common to the 32bit
entry as well?

~Andrew

> +
>   jmp xen_start_kernel
>  END(startup_xen)
>   __FINIT

Re: [PATCH 23/35] x86/speculation: Add basic speculation control code

2018-01-18 Thread Andrew Cooper

On 18/01/2018 23:25, Andy Lutomirski wrote:
> On Thu, Jan 18, 2018 at 11:08 AM, Andrea Arcangeli  
> wrote:
>> On Thu, Jan 18, 2018 at 12:24:31PM -0600, Josh Poimboeuf wrote:
>>> On Thu, Jan 18, 2018 at 06:12:36PM +0100, Paolo Bonzini wrote:
 On 18/01/2018 18:08, Dave Hansen wrote:
> On 01/18/2018 08:37 AM, Josh Poimboeuf wrote:
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -3932,6 +3932,7 @@
>>> retpoline - replace indirect branches
>>> retpoline,generic - google's original retpoline
>>> retpoline,amd - AMD-specific minimal thunk
>>> +   ibrs  - Intel: Indirect Branch 
>>> Restricted Speculation
>> Are there plans to add spectre_v2=ibrs_always to prevent SMT-based
>> attacks?
> What does "ibrs_always" mean to you?
>>> Maybe ibrs_always isn't the best name.  Basically we need an option to
>>> protect user-user attacks via SMT.
>>>
>>> It could be implemented with IBRS=1, or STIBP, or as part of the
>>> mythical IBRS_ATT.
>> User stibp or user ibrs would be different things, both would be valid
>> for different use cases, and the user stibp should perform better.
>>
>> Leaving ibrs on when returning from kernel to userland (or setting
>> ibrs if kernel used retpolines instead of ibrs) achieves stronger
>> semantics than just setting SPEC_CTRL with stibp when returning to
>> userland.
> I read the whitepaper that documented the new MSRs a couple days ago
> and I'm now completely unable to find it.  If anyone could send the
> link, that would be great.

https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

~Andrew

Re: [PATCH v2 2/8] x86/cpufeatures: Add AMD feature bits for Prediction Command

2018-01-21 Thread Andrew Cooper

On 21/01/18 17:50, Tom Lendacky wrote:
> On 1/21/2018 3:49 AM, David Woodhouse wrote:
>> AMD doesn't implement the Speculation Control MSR that Intel does, but
>> the Prediction Control MSR does exist and is advertised by a separate
>> CPUID bit. Add support for that.
>>
>> Signed-off-by: David Woodhouse 
>> ---
>>  arch/x86/include/asm/cpufeatures.h | 1 +
>>  arch/x86/kernel/cpu/scattered.c| 1 +
>>  2 files changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/cpufeatures.h 
>> b/arch/x86/include/asm/cpufeatures.h
>> index 2efb8d4..8c9e5c0 100644
>> --- a/arch/x86/include/asm/cpufeatures.h
>> +++ b/arch/x86/include/asm/cpufeatures.h
>> @@ -207,6 +207,7 @@
>>  #define X86_FEATURE_RETPOLINE_AMD   ( 7*32+13) /* AMD Retpoline mitigation 
>> for Spectre variant 2 */
>>  #define X86_FEATURE_INTEL_PPIN  ( 7*32+14) /* Intel Processor 
>> Inventory Number */
>>  
>> +#define X86_FEATURE_AMD_PRED_CMD( 7*32+17) /* Prediction Command MSR 
>> (AMD) */
>>  #define X86_FEATURE_MBA ( 7*32+18) /* Memory Bandwidth 
>> Allocation */
>>  #define X86_FEATURE_RSB_CTXSW   ( 7*32+19) /* Fill RSB on 
>> context switches */
>>  
>> diff --git a/arch/x86/kernel/cpu/scattered.c 
>> b/arch/x86/kernel/cpu/scattered.c
>> index df11f5d..4eb90b2 100644
>> --- a/arch/x86/kernel/cpu/scattered.c
>> +++ b/arch/x86/kernel/cpu/scattered.c
>> @@ -28,6 +28,7 @@ static const struct cpuid_bit cpuid_bits[] = {
>>  { X86_FEATURE_HW_PSTATE,CPUID_EDX,  7, 0x8007, 0 },
>>  { X86_FEATURE_CPB,  CPUID_EDX,  9, 0x8007, 0 },
>>  { X86_FEATURE_PROC_FEEDBACK,CPUID_EDX, 11, 0x8007, 0 },
>> +{ X86_FEATURE_AMD_PRED_CMD, CPUID_EBX, 12, 0x8008, 0 },
> I replied to the previous version, but I'll add it here, too.
>
> This should be moved to the existing 0x8008/EBX entry rather than have
> it in scattered.
>
> Also, there will be a total of three bits:
>   IBPB:  0x8008 EBX[12]
>   IBRS:  0x8008 EBX[14]
>   STIBP: 0x8008 EBX[15]
>
> Since IBRS and STIBP share the same MSR, if a processor only supports
> STIBP (MSR bit 1), for ease of software implementation the processor
> does not GP fault attempts to write bit 0. In a similar manner, if a
> processor only suppors IBRS (MSR bit 0), the processor does not GP
> fault attempts to write bit 1.

Are you able to comment on the read behaviour after a write which is
ignored?

If the behaviour is "read as written" then virt cases are fine.  If the
"ignore" causes a zero to be read back, then we're still going to need
to intercept and emulate all VM accesses.

Thanks,

~Andrew

Re: [PATCH v2 5/8] x86/speculation: Add basic support for IBPB

2018-01-21 Thread Andrew Cooper

On 21/01/18 19:31, David Woodhouse wrote:
> On Sun, 2018-01-21 at 20:01 +0100, Borislav Petkov wrote:
>> so execution runs directly into the MSR write and the JMP is gone.
>>
>> So I don't see indirect branches anywhere...
> Wait until the wind changes.
>
> Congratulations, you've just turned a potential GCC missed optimisation
> into a kernel bug. We don't *care* that it's unlikely that GCC will
> miss that optimisation. The point is that it doesn't *have* to do it,
> and we don't *check*.
>
> cf. https://lkml.org/lkml/2018/1/12/176
>
>
> ... after which Peter went off and implemented that check, which is all
> fine and dandy but let's not rely on backporting that too.

It doesn't matter if an attacker can use SP1 to try and skip the IBPB.

Exits to userspace/guest are serialising (with some retroactive updates
to the architecture spec coming), so an attacker can't cause victim code
to be executed before speculation has caught up and noticed that the
IBPB did need to happen.

~Andrew

Re: [PATCH v2 5/8] x86/speculation: Add basic support for IBPB

2018-01-21 Thread Andrew Cooper

On 21/01/2018 20:04, David Woodhouse wrote:
> On Sun, 2018-01-21 at 19:37 +0000, Andrew Cooper wrote:
>> It doesn't matter if an attacker can use SP1 to try and skip the IBPB.
>>
>> Exits to userspace/guest are serialising (with some retroactive updates
>> to the architecture spec coming), so an attacker can't cause victim code
>> to be executed before speculation has caught up and noticed that the
>> IBPB did need to happen.
> For the specific case of IBPB, knowing what we do about non-
> architectural behaviour, that's probably true.
>
> In the early patch sets in both Xen and Linux, we did have a
> conditional branch on {sys,hyper}call entry that blithely let the CPU
> speculate all the way to the {sys,hyper}call table jump. No exit to
> userspace/guest there.

Right, but that is a different situation.  That is an attacker trying to
attack the kernel/hypervisor directly using SP2, which is mitigated with
retpoline/lfence+jmp/IBRS (as appropriate).

This IBPB case is an attacker trying to attack a new piece of userspace
using SP2, and furthermore, trying to use SP1 to skip the IBPB.

It is an inherent property of all these issues that an attacker can't
cause the misdirected basic blocks to be retired, which means they can't
change the actual behaviour of execution in supervisor context.

As the exit to user/guest context is serialising, the only thing the
attacker can usefully do is tickle a speculatively-leaky block.

> Which is why I've been saying I want call sites to have an *explicit*
> comment saying why they're safe to use conditional branches without
> taking extra steps to be safe, like the 'else lfence'. And why I'd
> really like the underlying primitives to *support* being fixed at
> runtime.

I'm afraid that, by this logic, every conditional branch needs a
comment, and that is impractical.  I don't see what is special about
this conditional branch vs every other conditional branch in the
codebase, and calling it out in isolation feels wrong.

~Andrew

Re: [PATCH v2 2/8] x86/cpufeatures: Add AMD feature bits for Prediction Command

2018-01-22 Thread Andrew Cooper

On 22/01/18 14:31, Tom Lendacky wrote:
> On 1/21/2018 12:01 PM, Andrew Cooper wrote:
>> On 21/01/18 17:50, Tom Lendacky wrote:
>>> On 1/21/2018 3:49 AM, David Woodhouse wrote:
>>>> AMD doesn't implement the Speculation Control MSR that Intel does, but
>>>> the Prediction Control MSR does exist and is advertised by a separate
>>>> CPUID bit. Add support for that.
>>>>
>>>> Signed-off-by: David Woodhouse 
>>>> ---
>>>>  arch/x86/include/asm/cpufeatures.h | 1 +
>>>>  arch/x86/kernel/cpu/scattered.c| 1 +
>>>>  2 files changed, 2 insertions(+)
>>>>
>>>> diff --git a/arch/x86/include/asm/cpufeatures.h 
>>>> b/arch/x86/include/asm/cpufeatures.h
>>>> index 2efb8d4..8c9e5c0 100644
>>>> --- a/arch/x86/include/asm/cpufeatures.h
>>>> +++ b/arch/x86/include/asm/cpufeatures.h
>>>> @@ -207,6 +207,7 @@
>>>>  #define X86_FEATURE_RETPOLINE_AMD ( 7*32+13) /* AMD Retpoline mitigation 
>>>> for Spectre variant 2 */
>>>>  #define X86_FEATURE_INTEL_PPIN( 7*32+14) /* Intel Processor 
>>>> Inventory Number */
>>>>  
>>>> +#define X86_FEATURE_AMD_PRED_CMD  ( 7*32+17) /* Prediction Command MSR 
>>>> (AMD) */
>>>>  #define X86_FEATURE_MBA   ( 7*32+18) /* Memory Bandwidth 
>>>> Allocation */
>>>>  #define X86_FEATURE_RSB_CTXSW ( 7*32+19) /* Fill RSB on 
>>>> context switches */
>>>>  
>>>> diff --git a/arch/x86/kernel/cpu/scattered.c 
>>>> b/arch/x86/kernel/cpu/scattered.c
>>>> index df11f5d..4eb90b2 100644
>>>> --- a/arch/x86/kernel/cpu/scattered.c
>>>> +++ b/arch/x86/kernel/cpu/scattered.c
>>>> @@ -28,6 +28,7 @@ static const struct cpuid_bit cpuid_bits[] = {
>>>>{ X86_FEATURE_HW_PSTATE,CPUID_EDX,  7, 0x8007, 0 },
>>>>{ X86_FEATURE_CPB,  CPUID_EDX,  9, 0x8007, 0 },
>>>>{ X86_FEATURE_PROC_FEEDBACK,CPUID_EDX, 11, 0x8007, 0 },
>>>> +  { X86_FEATURE_AMD_PRED_CMD, CPUID_EBX, 12, 0x8008, 0 },
>>> I replied to the previous version, but I'll add it here, too.
>>>
>>> This should be moved to the existing 0x8008/EBX entry rather than have
>>> it in scattered.
>>>
>>> Also, there will be a total of three bits:
>>>   IBPB:  0x8008 EBX[12]
>>>   IBRS:  0x8008 EBX[14]
>>>   STIBP: 0x8008 EBX[15]
>>>
>>> Since IBRS and STIBP share the same MSR, if a processor only supports
>>> STIBP (MSR bit 1), for ease of software implementation the processor
>>> does not GP fault attempts to write bit 0. In a similar manner, if a
>>> processor only suppors IBRS (MSR bit 0), the processor does not GP
>>> fault attempts to write bit 1.
>> Are you able to comment on the read behaviour after a write which is
>> ignored?
>>
>> If the behaviour is "read as written" then virt cases are fine.  If the
>> "ignore" causes a zero to be read back, then we're still going to need
>> to intercept and emulate all VM accesses.
> The behavior is "read as written", so the bit will be updated even though
> the support for the bit is not present.

Fantastic!  Thanks for confirming.

~Andrew

Re: [PATCH 0/7] IBRS patch series

2018-01-04 Thread Andrew Cooper

On 04/01/18 19:33, Linus Torvalds wrote:
> On Thu, Jan 4, 2018 at 11:19 AM, David Woodhouse  wrote:
>> On Skylake the target for a 'ret' instruction may also come from the
>> BTB. So if you ever let the RSB (which remembers where the 'call's came
>> from get empty, you end up vulnerable.
> That sounds like it could cause mispredicts, but it doesn't sound 
> _exploitable_.
>
> Sure, interrupts in between the call instruction and the 'ret' could
> overflow the return stack. And we could migrate to another CPU. And so
> apparently SMM clears the return stack too.
>
> ... but again, none of them sound even remotely _exploitable_.
>
> Remember: it's not mispredicts that leak information. It's *exploits"
> that use forced very specific  mispredicts to leak information.
>
> There's a big difference there. And I think patch authors should keep
> that difference in mind.
>
> For example, flushing the BTB at kernel entry doesn't mean that later
> in-kernel indirect branches don't get predicted, and doesn't even mean
> that they don't get mis-predicted. It only means that an exploit can't
> pre-populate those things and use them for exploits.

Retpoline as a mitigation strategy swaps indirect branches for returns,
to avoid using predictions which come from the BTB, as they can be
poisoned by an attacker.

The problem with Skylake+ is that an RSB underflow falls back to using a
BTB prediction, which allows the attacker to take control of speculation.

Also remember that sibling threads share a BTB, so you can't rely on
isolated straight-line codepath on the current cpu for safety. (e.g. by
issuing an IBPB on every entry to supervisor mode).

~Andrew

Re: [PATCH 5/7] x86: Use IBRS for firmware update path

2018-01-04 Thread Andrew Cooper

On 04/01/18 20:05, Greg KH wrote:
> On Thu, Jan 04, 2018 at 09:56:46AM -0800, Tim Chen wrote:
>> From: David Woodhouse 
>>
>> We are impervious to the indirect branch prediction attack with retpoline
>> but firmware won't be, so we still need to set IBRS to protect
>> firmware code execution when calling into firmware at runtime.
> Wait, what?
>
> Maybe it's just the wine from dinner talking, but if the firmware has
> issues, we have bigger things to worry about here, right?  It already
> handed over the "chain of trust" to us, so we have already implicitly
> trusted that the firmware was correct here.  So why do we need to do
> anything about firmware calls in this manner?
>
> Or am I totally missing something else here?

The firmware doesn't have to be malicious to cause problems for the OS.

There is still an open question of what happens in the RSB-to-SMM case,
where the SMM handler empties the RSB just before supervisor code
executes a ret instruction.  Hardware (other than the Skylake+ case
which uses a BTB prediction) speculates to the stale top-of-RSB entry,
for want of anything better to do.  (AMD have confirmed this, Intel
haven't replied to my query yet.)

Therefore, a crafty piece of userspace can stick a speculative leaky
gadget at a linear address which aliases the SMM code, and wait for an
SMI to hit.

To mitigate, a kernel has to hope that the SMM handler doesn't run in a
non-identity mappings, and either rely on SMEP being active, or disallow
userspace mmap()'s covering the SMM region.

True, exploiting this is probably on the upper end of the difficulty
scale here, but I'm willing to be its not the only unexpected
interaction going.

~Andrew

Re: Avoid speculative indirect calls in kernel

2018-01-04 Thread Andrew Cooper

On 04/01/2018 23:47, Tom Lendacky wrote:
> On 1/4/2018 2:05 PM, David Woodhouse wrote:
>> On Thu, 2018-01-04 at 14:00 -0600, Tom Lendacky wrote:
>>> Yes, lfence is sufficient.  As long as the target is in the register
>>> before the lfence and we jump through the register all is good, i.e.:
>> Thanks. Can I have a Reviewed-by: for this then please:
> Reviewed-by: Tom Lendacky 
>
> While this works, a more efficient way to do the lfence support would be
> to not use the retpoline in this case.  Changing the indirect jumps to
> do the "mov [rax], rax; lfence; jmp *rax" sequence would be quicker. I'm
> not sure if this is feasible given the need to do a retpoline if you can't
> use lfence, though.

That would be most efficient for AMD, but it isn't compatible with
having a single binary which can mitigate itself most efficiently
wherever it was booted.  On most hardware, we'll want to dynamically
chose between repoline and lfence depending on vendor.

One option would be to teach GCC/Clang/Other to output alternative
patch-point data for indirect branches in the format Linux/Xen could
consume, and feed this into the alternatives framework.

The practical option to actually deploy in the timeframe is to use
__x86.indirect_thunk.%reg and alternate between repoline and lfence in
15 locations, which does add an unconditional call/jmp over the most
efficient alternative, but allows us to switch the thunk-in-use at boot
time.

~Andrew

Re: [PATCH v5 02/12] x86/retpoline: Add initial retpoline support

2018-01-06 Thread Andrew Cooper

On 06/01/18 11:49, David Woodhouse wrote:
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 372ba3f..40e6e54 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -904,6 +904,11 @@ static void __init early_identify_cpu(struct cpuinfo_x86 
> *c)
>  
>   setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
>   setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
> +#ifdef CONFIG_RETPOLINE
> + setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> + if (c->x86_vendor == X86_VENDOR_AMD)
> + setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);

This isn't safe.  It needs to be dependant on finding that LFENCEs are
actually dispatch serialising.

In particular, when virtualised, you'll most likely be saddled with the
hypervisors choice of setting, in which case you need to use retpoline
as a fallback.

~Andrew

> +#endif
>  
>   fpu__init_system(c);
>  
>

Re: [PATCH v5 02/12] x86/retpoline: Add initial retpoline support

2018-01-06 Thread Andrew Cooper

On 06/01/18 21:23, Thomas Gleixner wrote:
> On Sat, 6 Jan 2018, Andrew Cooper wrote:
>> On 06/01/18 11:49, David Woodhouse wrote:
>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>> index 372ba3f..40e6e54 100644
>>> --- a/arch/x86/kernel/cpu/common.c
>>> +++ b/arch/x86/kernel/cpu/common.c
>>> @@ -904,6 +904,11 @@ static void __init early_identify_cpu(struct 
>>> cpuinfo_x86 *c)
>>>  
>>> setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
>>> setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
>>> +#ifdef CONFIG_RETPOLINE
>>> +   setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
>>> +   if (c->x86_vendor == X86_VENDOR_AMD)
>>> +   setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
>> This isn't safe.  It needs to be dependant on finding that LFENCEs are
>> actually dispatch serialising.
>>
>> In particular, when virtualised, you'll most likely be saddled with the
>> hypervisors choice of setting, in which case you need to use retpoline
>> as a fallback.
> On bare metal we are sure, the virtualization part is a different question.

Leaving virtualisation to one side, how does this cope with pre-SSE2
hardware?

~Andrew

Re: [tip:x86/pti] x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC

2018-01-08 Thread Andrew Cooper

On 08/01/18 10:08, Thomas Gleixner wrote:
> On Sat, 6 Jan 2018, tip-bot for Tom Lendacky wrote:
>
>> Commit-ID:  0bf17c102177d5da9363bf8b1e4704b9996d5079
>> Gitweb: 
>> https://git.kernel.org/tip/0bf17c102177d5da9363bf8b1e4704b9996d5079
>> Author: Tom Lendacky 
>> AuthorDate: Fri, 5 Jan 2018 10:07:56 -0600
>> Committer:  Thomas Gleixner 
>> CommitDate: Sat, 6 Jan 2018 21:57:40 +0100
>>
>> x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC
>>
>> With LFENCE now a serializing instruction, set the LFENCE_RDTSC
>> feature since the LFENCE instruction has less overhead than the
>> MFENCE instruction.
> Second thoughts on that. As pointed out by someone in one of the insane
> long threads:
>
> What happens if the kernel runs as a guest and
>
>   - the hypervisor did not set the LFENCE to serializing on the host
>
>   - the hypervisor does not allow writing MSR_AMD64_DE_CFG
>
> That would bring the guest into a pretty bad state or am I missing
> something essential here?

What I did in Xen was to attempt to set it, then read it back and see. 
If LFENCE still isn't serialising, using repoline is the only available
mitigation.

My understanding from the folk at AMD is that retpoline is safe to use,
but has higher overhead than the LFENCE approach.

~Andrew

Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel

2018-01-08 Thread Andrew Cooper

On 08/01/18 10:42, Paul Turner wrote:
> A sequence for efficiently refilling the RSB is:
> mov $8, %rax;
> .align 16;
>3: call 4f;
>   3p: pause; call 3p;
>  .align 16;
>   4: call 5f;
>   4p: pause; call 4p;
>  .align 16;
>5: dec %rax;
>   jnz 3b;
>   add $(16*8), %rsp;
> This implementation uses 8 loops, with 2 calls per iteration.  This is
> marginally faster than a single call per iteration.  We did not
> observe useful benefit (particularly relative to text size) from
> further unrolling.  This may also be usefully split into smaller (e.g.
> 4 or 8 call)  segments where we can usefully pipeline/intermix with
> other operations.  It includes retpoline type traps so that if an
> entry is consumed, it cannot lead to controlled speculation.  On my
> test system it took ~43 cycles on average.  Note that non-zero
> displacement calls should be used as these may be optimized to not
> interact with the RSB due to their use in fetching RIP for 32-bit
> relocations.

Guidance from both Intel and AMD still states that 32 calls are required
in general.  Is your above code optimised for a specific processor which
you know the RSB to be smaller on?

~Andrew

Re: [tip:x86/pti] x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC

2018-01-08 Thread Andrew Cooper

On 08/01/18 14:47, Tom Lendacky wrote:
> On 1/8/2018 5:10 AM, Thomas Gleixner wrote:
>> On Mon, 8 Jan 2018, Andrew Cooper wrote:
>>
>>> On 08/01/18 10:08, Thomas Gleixner wrote:
>>>> On Sat, 6 Jan 2018, tip-bot for Tom Lendacky wrote:
>>>>
>>>>> Commit-ID:  0bf17c102177d5da9363bf8b1e4704b9996d5079
>>>>> Gitweb: 
>>>>> https://git.kernel.org/tip/0bf17c102177d5da9363bf8b1e4704b9996d5079
>>>>> Author: Tom Lendacky 
>>>>> AuthorDate: Fri, 5 Jan 2018 10:07:56 -0600
>>>>> Committer:  Thomas Gleixner 
>>>>> CommitDate: Sat, 6 Jan 2018 21:57:40 +0100
>>>>>
>>>>> x86/cpu/AMD: Use LFENCE_RDTSC instead of MFENCE_RDTSC
>>>>>
>>>>> With LFENCE now a serializing instruction, set the LFENCE_RDTSC
>>>>> feature since the LFENCE instruction has less overhead than the
>>>>> MFENCE instruction.
>>>> Second thoughts on that. As pointed out by someone in one of the insane
>>>> long threads:
>>>>
>>>> What happens if the kernel runs as a guest and
>>>>
>>>>   - the hypervisor did not set the LFENCE to serializing on the host
>>>>
>>>>   - the hypervisor does not allow writing MSR_AMD64_DE_CFG
>>>>
>>>> That would bring the guest into a pretty bad state or am I missing
>>>> something essential here?
>>> What I did in Xen was to attempt to set it, then read it back and see. 
>>> If LFENCE still isn't serialising, using repoline is the only available
>>> mitigation.
>>>
>>> My understanding from the folk at AMD is that retpoline is safe to use,
>>> but has higher overhead than the LFENCE approach.
> Correct, the retpoline will work, it just takes more cycles.
>
>> That still does not help vs. rdtsc_ordered() and LFENCE_RDTSC ...
> Ok, I can add the read-back check before setting the feature flag(s).
>
> But... what about the case where the guest is a different family than
> hypervisor? If we're on, say, a Fam15h hypervisor but the guest is started
> as a Fam0fh guest where the MSR doesn't exist and LFENCE is supposed to be
> serialized?  I'll have to do a rdmsr_safe() and only set the flag(s) if I
> can successfully read the MSR back and validate the bit.

If your hypervisor is lying to you about the primary family, then all
bets are off.  I don't expect there will be any production systems doing
this.

The user can get to keep both pieces if they've decided that this was a
good thing to try.

~Andrew

Re: [RFC PATCH 00/16] PTI support for x86-32

2018-01-16 Thread Andrew Cooper

On 16/01/18 18:59, Linus Torvalds wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel  wrote:
>> One of the things that are surely broken is XEN_PV support.
>> I'd appreciate any help with testing and bugfixing on that
>> front.
> Xen PV and PTI don't work together even on x86-64 afaik, the Xen
> people apparently felt it wasn't worth it.  See the
>
> if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
> pti_print_if_insecure("disabled on XEN PV.");
> return;
> }

64bit PV guests under Xen already have split pagetables.  It is a base
and necessary part of the ABI, because segment limits stopped working in
64bit.

32bit PV guests aren't split, but by far the most efficient way of doing
this is to introduce a new enlightenment and have Xen switch all this
stuff (and IBRS, for that matter) on behalf of the guest kernel on
context switch.

~Andrew

Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

2018-01-17 Thread Andrew Cooper

On 17/01/18 09:02, Joerg Roedel wrote:
> Hi Boris,
>
> thanks for testing this :)
>
> On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
>> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
>>> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>
>> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>>
>> With those macros fixed I was able to boot 32-bit Xen PV guest.
> Hmm, on bare metal the stack switch happens regardless of the
> X86_FEATURE_PTI feature being set, because we always program tss.sp0
> with the systenter stack. How is the kernel entry stack setup on xen-pv?
> I think something is missing there instead.

There is one single stack registered with Xen, on which you get a normal
exception frame in all cases, even via the registered (virtual)
syscall/sysenter/failsafe handlers.

~Andrew

Re: [PATCH v6 11/10] x86/retpoline: Avoid return buffer underflows on context switch

2018-01-08 Thread Andrew Cooper

On 09/01/2018 00:58, Linus Torvalds wrote:
> On Mon, Jan 8, 2018 at 4:44 PM, Andi Kleen  wrote:
>> Essentially the RSB are hidden registers, and the only way to clear them
>> is the FILL_RETURN_BUFFER sequence.  I don't see how clearing anything else
>> would help?
> Forget theory. Look at practice.
>
> Let's just assume that the attacker can write arbitrarily to the RSB
> state.  Just accept it.
>
> If you accept that, then you turn the question instead into: are there
> things we can do to make that useless to an attacker.
>
> And there the point is that even if you control the RSB contents, you
> need to find something relevant to *put* in the RSB. You need to find
> the gadget that makes that control of the RSB useful.

This is where the problem lies.  An attacker with arbitrary control of
the RSB can redirect speculation arbitrarily.  (And I agree, this is a
good default assumption to take).

If SMEP is not active, speculation can go anywhere, including to a user
controlled gadget which can reload any registers it needs, including
with immediate constants.

If SMEP is active, the attackers control of speculation is restricted to
supervisor executable mappings.

The real question is whether it is worth special casing the SMEP-active
case given that for the SMEP-inactive case, your only viable option is
to refill the RSB and discard any potentially poisoned mappings.

~Andrew

Re: Improve retpoline for Skylake

2018-01-15 Thread Andrew Cooper

On 15/01/18 16:57, Andy Lutomirski wrote:
>
>> On Jan 15, 2018, at 12:26 AM, Jon Masters  wrote:
>>
>>> On 01/12/2018 05:03 PM, Henrique de Moraes Holschuh wrote:
>>> On Fri, 12 Jan 2018, Andi Kleen wrote:
> Skylake still loses if it takes an SMI, right? 
 SMMs are usually rare, especially on servers, and are usually
 not very predictible, and even if you have
>>> FWIW, a data point: SMIs can be generated on demand by userspace on
>>> thinkpad laptops, but they will be triggered from within a kernel
>>> context.  I very much doubt this is a rare pattern...
>> Sure. Just touch some "legacy" hardware that the vendor emulates in a
>> nasty SMI handler. It's definitely not acceptable to assume that SMIs
>> can't be generated under the control of some malicious user code.
>>
>> Our numbers on Skylake weren't bad, and there seem to be all kinds of
>> corner cases, so again, it seems as if IBRS is the safest choice.
>>
> And keep in mind that SMIs generally hit all CPUs at once, making them extra 
> nasty.
>
> Can we get firmware vendors to refill the return buffer just before RSM?

Refill or not, you are aware that a correctly timed SMI in a leaf
function will cause the next ret to speculate into userspace, because
there is guaranteed peturbance in the RSB?  (On the expectation that the
SMM handler isn't entirely devoid of function calls).

Having firmware refill the RSB only makes a difference if you are on
Skylake+ were RSB underflows are bad, and you're not using IBRS to
protect your indirect predictions.

~Andrew

Re: [PATCH] fix one dead link in ia64/xen.txt

2018-03-20 Thread Andrew Cooper

On 20/03/18 19:56, Dongliang Mu wrote:
> Signed-off-by: Dongliang Mu 
> ---
>  Documentation/ia64/xen.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/ia64/xen.txt b/Documentation/ia64/xen.txt
> index a12c74ce2773..464d4c29b8b5 100644
> --- a/Documentation/ia64/xen.txt
> +++ b/Documentation/ia64/xen.txt
> @@ -26,8 +26,8 @@ Getting and Building Xen and Dom0
>  DomainU OS  : RHEL5
>  
>   1. Download source
> -# hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg
> -# cd xen-unstable.hg
> +# hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable
> +# cd xen-unstable
>  # hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg
>  
>   2. # make world

The last commit in that repository is almost 9 years old, and IA64
support was dropped from Xen mainline6 years ago.

http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=570c311ca2c7a1131570cdfc77e977bc7a9bf4c0

There are a number of other dead links in this doc, and those which
aren't dead refer to Linux 2.6.x.  I'd just remove the entire file,
rather than pretend that any of this still might work.  (If by some
miracle it does still function, its 10 years behind on security fixes...)

~Andrew

Re: [PATCH] KVM: X86: Fix the decoding of segment overrides in 64bit mode

2018-03-22 Thread Andrew Cooper

On 22/03/2018 10:07, Paolo Bonzini wrote:
> On 22/03/2018 09:34, Wanpeng Li wrote:
>> From: Wanpeng Li 
>>
>> Explicit segment overides other than %fs and %gs are documented as ignored by
>> both Intel and AMD.
>>
>> In practice, this means that:
>>
>>  * Explicit uses of %ss don't actually yield #SS[0] for non-canonical
>>memory references.
>>  * Explicit uses of %{e,c,d}s don't override %rbp/%rsp-based memory 
>> references
>>to yield #GP[0] for non-canonical memory references.
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 

When porting fixes from other projects, it is customary to identify so
in the commit message.  In this case, the fix you've ported is
http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b7dce29d9faf3597d009c853ed1fcbed9f7a7f68

Here is an example of how Xen ports fixes from Linux for the drivers
that we share. 
http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=4e131596f1defec9407b6e60d584a696beaf5d7e


>> ---
>>  arch/x86/kvm/emulate.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index dd88158..5091255 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -5148,8 +5148,10 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, 
>> void *insn, int insn_len)
>>  case 0x2e:  /* CS override */
>>  case 0x36:  /* SS override */
>>  case 0x3e:  /* DS override */
>> -has_seg_override = true;
>> -ctxt->seg_override = (ctxt->b >> 3) & 3;
>> +if (mode != X86EMUL_MODE_PROT64) {
>> +has_seg_override = true;
>> +ctxt->seg_override = (ctxt->b >> 3) & 3;
>> +}
>>  break;
>>  case 0x64:  /* FS override */
>>  case 0x65:  /* GS override */
>>
> Testcase, please...

If you want to crib from one, this is the testcase I made for Xen.

http://xenbits.xen.org/docs/xtf/test-memop-seg.html

With the impending KVM/PVH work which is ongoing, it will soon be easy
to run Xen's HVM test suite unmodified under KVM, but we're not quite
there yet.

~Andrew

Re: [PATCH] KVM: X86: Fix the decoding of segment overrides in 64bit mode

2018-03-22 Thread Andrew Cooper

On 22/03/2018 10:42, Paolo Bonzini wrote:
> On 22/03/2018 11:19, Andrew Cooper wrote:
>> On 22/03/2018 10:07, Paolo Bonzini wrote:
>>> On 22/03/2018 09:34, Wanpeng Li wrote:
>>>> From: Wanpeng Li 
>>>>
>>>> Explicit segment overides other than %fs and %gs are documented as ignored 
>>>> by
>>>> both Intel and AMD.
>>>>
>>>> In practice, this means that:
>>>>
>>>>  * Explicit uses of %ss don't actually yield #SS[0] for non-canonical
>>>>memory references.
>>>>  * Explicit uses of %{e,c,d}s don't override %rbp/%rsp-based memory 
>>>> references
>>>>to yield #GP[0] for non-canonical memory references.
>>>>
>>>> Cc: Paolo Bonzini 
>>>> Cc: Radim Krčmář 
>>>> Signed-off-by: Wanpeng Li 
>> When porting fixes from other projects, it is customary to identify so
>> in the commit message.  In this case, the fix you've ported is
>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b7dce29d9faf3597d009c853ed1fcbed9f7a7f68
>>
>> Here is an example of how Xen ports fixes from Linux for the drivers
>> that we share. 
>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=4e131596f1defec9407b6e60d584a696beaf5d7e
> Thanks Andrew.  The code is of course completely different, but the
> commit message is 1:1.  Wanpeng, please acknowledge Jan in v2!

Thanks, but it is actually my patch, which is why I was confused at
seeing my own commit message on LKML.

Also, the chances are that there are similar issues decoding the
instruction info field in the VMCS, which is how I stumbled onto this in
the first place.  I haven't yet fixed that side of things for Xen.

>
>>> Testcase, please...
>> If you want to crib from one, this is the testcase I made for Xen.
>>
>> http://xenbits.xen.org/docs/xtf/test-memop-seg.html
> How does it ensure that the code is executed through the emulator and
> not by the processor?

This test, and most of the tests in general, deliberately set things up
to execute the test cases first on the main processor, and then via the
emulator, and complain when any result is unexpected.

We've got a Force Emulation Prefix (ud2a; .ascii "xen") for doing
magic.  Originally, this was used for PV guests to explicitly request an
emulated CPUID, but I extended it to HVM guests for "emulate the next
instruction", after we had some guest user => guest kernel privilege
escalations because of incorrect emulation.

Fundamentally, any multi-vcpu guest can force an arbitrary instruction
through the emulator, because rewriting a couple of bytes of instruction
stream is far far far faster than a vmexit.  I chose to introduce a
explicit way to force this to occur, for testing purposes.

>
>> With the impending KVM/PVH work which is ongoing, it will soon be easy
>> to run Xen's HVM test suite unmodified under KVM, but we're not quite
>> there yet.
> What does the test suite use for console I/O?

Depends on what it available as it boots, but one of the default
consoles is port 0x12.  If things need to be tweaked to work more
cleanly, then that is entirely fine.

~Andrew

Re: [PATCH] KVM: X86: Fix the decoding of segment overrides in 64bit mode

2018-03-22 Thread Andrew Cooper

On 22/03/18 13:39, Wanpeng Li wrote:
> 2018-03-22 20:38 GMT+08:00 Paolo Bonzini :
>> On 22/03/2018 12:04, Andrew Cooper wrote:
>>> We've got a Force Emulation Prefix (ud2a; .ascii "xen") for doing
>>> magic.  Originally, this was used for PV guests to explicitly request an
>>> emulated CPUID, but I extended it to HVM guests for "emulate the next
>>> instruction", after we had some guest user => guest kernel privilege
>>> escalations because of incorrect emulation.
>> Wanpeng, why don't you add it behind a new kvm module parameter? :)
> Great point! I will have a try. Thanks Paolo and Andrew. :)

Using the force emulation prefix requires intercepting #UD, which is in
general a BadThing(tm) for security.  Therefore, we have a build time
configuration option to compile in support, and require that test
systems explicitly opt into using it via a command line parameter.

http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/hvm/hvm.c;h=db52312882205e65b32e587106ca795f4bfab2eb;hb=refs/heads/staging#l3741
is the general #UD intercept handler if you want a reference.  (You can
ignore the cross-vendor part, which is leftovers from
http://developer.amd.com/wordpress/media/2012/10/CrossVendorMigration.pdf )

~Andrew

Re: [PATCH] KVM: X86: Fix the decoding of segment overrides in 64bit mode

2018-03-23 Thread Andrew Cooper

On 23/03/18 14:27, Wanpeng Li wrote:
> 2018-03-22 21:53 GMT+08:00 Andrew Cooper :
>> On 22/03/18 13:39, Wanpeng Li wrote:
>>> 2018-03-22 20:38 GMT+08:00 Paolo Bonzini :
>>>> On 22/03/2018 12:04, Andrew Cooper wrote:
>>>>> We've got a Force Emulation Prefix (ud2a; .ascii "xen") for doing
>>>>> magic.  Originally, this was used for PV guests to explicitly request an
>>>>> emulated CPUID, but I extended it to HVM guests for "emulate the next
>>>>> instruction", after we had some guest user => guest kernel privilege
>>>>> escalations because of incorrect emulation.
>>>> Wanpeng, why don't you add it behind a new kvm module parameter? :)
>>> Great point! I will have a try. Thanks Paolo and Andrew. :)
>> Using the force emulation prefix requires intercepting #UD, which is in
>> general a BadThing(tm) for security.  Therefore, we have a build time
> Yeah, however kvm intercepts and emulates #UD by default, should we
> add a new kvm module parameter to enable it and disable by default?
> Paolo.
>
>> configuration option to compile in support, and require that test
>> systems explicitly opt into using it via a command line parameter.
>>
>> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/hvm/hvm.c;h=db52312882205e65b32e587106ca795f4bfab2eb;hb=refs/heads/staging#l3741
>> is the general #UD intercept handler if you want a reference.  (You can
> Thanks Andrew, it is useful. :) In addition, I didn't see the
> test-memop-seg testcase has "Forced Emulation Prefix", when the prefix
> is added to each instruction in the testcase?

It has ended up substantially more ugly than I first intended, due to
several assembler bugs in older GCC and Clang toolchains.

http://xenbits.xen.org/gitweb/?p=xtf.git;a=blob;f=tests/memop-seg/asm.S;h=698661425bcdc9c181b235e323c2460e06c6e986;hb=HEAD#l35

I previously has FEP passed as a second parameter, but that becomes
prohibitively complicated to extract when testing %ss or %esp.  FEP is
now encoded in the bottom bit of the address passed in.

This was the cleanest way I could find of testing every combination, but
I'm open to improvements if anyone can spot any.

~Andrew

Re: [Xen-devel] [PATCH v8 1/7] xen/pvh: Split CONFIG_XEN_PVH into CONFIG_PVH and CONFIG_XEN_PVH

2018-12-06 Thread Andrew Cooper

On 06/12/2018 23:30, Paolo Bonzini wrote:
> On 07/12/18 00:11, Boris Ostrovsky wrote:
>> On 12/6/18 5:49 PM, Paolo Bonzini wrote:
>>> On 06/12/18 23:34, Boris Ostrovsky wrote:
 On 12/6/18 5:11 PM, Paolo Bonzini wrote:

> and also
>
>   depends on !EFI
>
> because even though in principle it would be possible to write a PVH
> loader for UEFI, PVH's start info does not support the EFI handover
> protocol.
 But we should be able to build the binary with both EFI and PVH?
>>> Can you?  It's a completely different binary format, the EFI handover
>>> protocol is invoked via a special entry point and needs the Linux header
>>> format, not ELF.
>> Right, but I think it is desirable to be able to build both from the
>> same config file.
> Ah, "make bzImage" and use the vmlinux for PVH, because PVH fetches the
> entry point from the special note.  That's clever. :)
>
> I don't see why it should not work, and if so the "depends on !EFI" is
> indeed unnecessary.

We do strive for single binaries in the Xen world, because that is how
people actually want to consume Xen.

It is for this reason why a single xen.gz binary can be loaded as a
straight ELF (including this PVH boot protocol), or via Multiboot 1 or
2, and even do full EFI if your bootloader is up to date on its
Multiboot2 spec :)

~Andrew

Re: [Xen-devel] [PATCH v2] xen: add new hypercall buffer mapping device

2018-06-18 Thread Andrew Cooper

On 18/06/18 15:36, Juergen Gross wrote:
> +static int privcmd_buf_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct privcmd_buf_private *file_priv = file->private_data;
> + struct privcmd_buf_vma_private *vma_priv;
> + unsigned long count = vma_pages(vma);
> + unsigned int i;
> + int ret = 0;
> +
> + if (!(vma->vm_flags & VM_SHARED) || count > (unsigned long)limit ||

You don't need the unsigned long cast, as limit will be promoted
automatically.

Otherwise, LGTM.

~Andrew

Re: [Xen-devel] [PATCH -next] x86/xen: Fix read buffer overflow

2018-12-18 Thread Andrew Cooper

On 18/12/2018 10:42, YueHaibing wrote:
> On 2018/12/18 16:31, Juergen Gross wrote:
>> On 18/12/2018 09:19, YueHaibing wrote:
>>> Fix smatch warning:
>>>
>>> arch/x86/xen/enlighten_pv.c:649 get_trap_addr() error:
>>>  buffer overflow 'early_idt_handler_array' 32 <= 32
>>>
>>> Fixes: 42b3a4cb5609 ("x86/xen: Support early interrupts in xen pv guests")
>>> Signed-off-by: YueHaibing 
>>> ---
>>>  arch/x86/xen/enlighten_pv.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
>>> index 2f6787f..81f200d 100644
>>> --- a/arch/x86/xen/enlighten_pv.c
>>> +++ b/arch/x86/xen/enlighten_pv.c
>>> @@ -646,7 +646,7 @@ static bool __ref get_trap_addr(void **addr, unsigned 
>>> int ist)
>>>  
>>> if (nr == ARRAY_SIZE(trap_array) &&
>>> *addr >= (void *)early_idt_handler_array[0] &&
>>> -   *addr < (void *)early_idt_handler_array[NUM_EXCEPTION_VECTORS]) {
>>> +   *addr < (void *)early_idt_handler_array[NUM_EXCEPTION_VECTORS - 1]) 
>>> {
>>> nr = (*addr - (void *)early_idt_handler_array[0]) /
>>>  EARLY_IDT_HANDLER_SIZE;
>>> *addr = (void *)xen_early_idt_handler_array[nr];
>>>
>> No, this patch is wrong.
>>
>> early_idt_handler_array is a 2-dimensional array:
>>
>> const char
>> early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
>>
>> So above code doesn't do an out of bounds array access, but checks for
>> *addr being in the array or outside of it (note the "<" used for the
>> test).
> Thank you for your explanation.

This looks like a smatch bug.  I'd feed it back upstream.

It is explicitly permitted in the C spec to construct a pointer to
one-past-the-end of an array, for the purposes of a < comparison.

I'm not entirely sure where the "32 <= 32" statement is coming from.

~Andrew

Re: [RFC v3 1/2] x86/xen: add xen_is_preemptible_hypercall()

2015-01-22 Thread Andrew Cooper

On 22/01/2015 21:09, Luis R. Rodriguez wrote:
> On Thu, Jan 22, 2015 at 12:01:50PM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 22, 2015 at 11:30 AM, Luis R. Rodriguez  wrote:
>>> On Wed, Jan 21, 2015 at 07:07:36PM -0800, Andy Lutomirski wrote:
>>>> On Wed, Jan 21, 2015 at 6:17 PM, Luis R. Rodriguez
>>>>  wrote:
>>>>> From: "Luis R. Rodriguez" 
>>>>>
>>>>> On kernels with voluntary or no preemption we can run
>>>>> into situations where a hypercall issued through userspace
>>>>> will linger around as it addresses sub-operatiosn in kernel
>>>>> context (multicalls). Such operations can trigger soft lockup
>>>>> detection.
>>>>>
>>>>> We want to address a way to let the kernel voluntarily preempt
>>>>> such calls even on non preempt kernels, to address this we first
>>>>> need to distinguish which hypercalls fall under this category.
>>>>> This implements xen_is_preemptible_hypercall() which lets us do
>>>>> just that by adding a secondary hypercall page, calls made via
>>>>> the new page may be preempted.
>>>>>
>>>>> Andrew had originally submitted a version of this work [0].
>>>>>
>>>>> [0] http://lists.xen.org/archives/html/xen-devel/2014-02/msg01056.html
>>>>>
>>>>> Based on original work by: Andrew Cooper 
>>>>>
>>>>> Cc: Andy Lutomirski 
>>>>> Cc: Borislav Petkov 
>>>>> Cc: David Vrabel 
>>>>> Cc: Thomas Gleixner 
>>>>> Cc: Ingo Molnar 
>>>>> Cc: "H. Peter Anvin" 
>>>>> Cc: x...@kernel.org
>>>>> Cc: Steven Rostedt 
>>>>> Cc: Masami Hiramatsu 
>>>>> Cc: Jan Beulich 
>>>>> Cc: linux-kernel@vger.kernel.org
>>>>> Signed-off-by: Luis R. Rodriguez 
>>>>> ---
>>>>>  arch/x86/include/asm/xen/hypercall.h | 20 
>>>>>  arch/x86/xen/enlighten.c |  7 +++
>>>>>  arch/x86/xen/xen-head.S  | 18 +-
>>>>>  3 files changed, 44 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/x86/include/asm/xen/hypercall.h 
>>>>> b/arch/x86/include/asm/xen/hypercall.h
>>>>> index ca08a27..221008e 100644
>>>>> --- a/arch/x86/include/asm/xen/hypercall.h
>>>>> +++ b/arch/x86/include/asm/xen/hypercall.h
>>>>> @@ -84,6 +84,22 @@
>>>>>
>>>>>  extern struct { char _entry[32]; } hypercall_page[];
>>>>>
>>>>> +#ifndef CONFIG_PREEMPT
>>>>> +extern struct { char _entry[32]; } preemptible_hypercall_page[];
>>>> A comment somewhere explaining why only non-preemptible kernels have
>>>> preemptible hypercalls might be friendly to some future reader. :)
>>> Good idea, since this section is arch specific, I'll instead add a blurb
>>> explaining this on the upcall.
>>>
>>>>> +
>>>>> +static inline bool xen_is_preemptible_hypercall(struct pt_regs *regs)
>>>>> +{
>>>>> +   return !user_mode_vm(regs) &&
>>>>> +   regs->ip >= (unsigned long)preemptible_hypercall_page &&
>>>>> +   regs->ip < (unsigned long)preemptible_hypercall_page + 
>>>>> PAGE_SIZE;
>>>>> +}
>>>> This makes it seem like the page is indeed one page long, but I don't
>>>> see what actually allocates a whole page for it.  What am I missing?
>>> arch/x86/xen/xen-head.S
>>>
>>> .pushsection .text
>>> .balign PAGE_SIZE
>>> ENTRY(hypercall_page)
>>>
>>> #ifndef CONFIG_PREEMPT
>>> ENTRY(preemptible_hypercall_page)
>>> .skip PAGE_SIZE
>>> #endif /* CONFIG_PREEMPT */
>>>
>>> Does that suffice to be sure?
>> This looks like hypercall_page and preemptible_hypercall_page will
>> both be page-aligned but will be the same page.  Should there be
>> another .skip PAGE_SIZE in there?
> I think the trick here was since hypercall_page is already aligned,
> and we are just allocation PAGE_SIZE we are essentially pegging
> preemptible_hypercall_page right after hypercall_page.
>
> Andrew, David, can you confirm?

Your version is different to my original one (observe the lack of
NEXT_HYPERCALL()s), and I would agree that it would appear as if in your
version, hypercall_page and preemptible_hypercall_page are symbols with
the same address.

nm should give you a quick confirmation one way or another.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [RFC v3 2/2] x86/xen: allow privcmd hypercalls to be preempted

2015-01-22 Thread Andrew Cooper

On 22/01/2015 20:58, Andy Lutomirski wrote:
> On Thu, Jan 22, 2015 at 12:37 PM, Steven Rostedt  wrote:
>> On Thu, 22 Jan 2015 12:24:47 -0800
>> Andy Lutomirski  wrote:
>>
 Also, please remove the "notrace", because function tracing goes an
 extra step to not require RCU being visible. The only thing you get
 with notrace is not being able to trace an otherwise traceable function.

>>> Is this also true for kprobes?  And can kprobes nest inside function
>>> tracing hooks?
>> No, kprobes are a bit more fragile than function tracing or tracepoints.
>>
>> And nothing should nest inside a function hook (except for interrupts,
>> they are fine).
>>
> But kprobes do nest inside interrupts, right?
>
>>> The other issue, above and beyond RCU, is that we can't let kprobes
>>> run on the int3 stack.  If Xen upcalls can happen when interrupts are
>>> off, then we may need this protection to prevent that type of
>>> recursion.  (This will be much less scary in 3.20, because userspace
>>> int3 instructions will no longer execute on the int3 stack.)
>> Does this execute between the start of the int3 interrupt handler and
>> the call of do_int3()?
> I doubt it.
>
> The thing I worry about is that, if do_int3 nests inside itself by any
> means (e.g. int3 sends a signal, scheduling for whatever reason
> (really shouldn't happen, but I haven't looked that hard)), then we're
> completely hosed -- the inner int3 will overwrite the outer int3's
> stack frame.  Since I have no idea what Xen upcalls do, I don't know
> whether they can fire inside do_int3.

The upcall is the "you have a virtual interrupt pending" signal and
should behave exactly like an external interrupt.  The exception frame
will appear to have interrupted the correct vcpu context, despite actual
trip via Xen.

Exceptions are handled as per native, with the xen_write_idt_entry()
PVOP taking care of registering the entry point with Xen, rather than
filling in a real IDT entry.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v5 2/2] x86/xen: allow privcmd hypercalls to be preempted on 64-bit

2015-01-27 Thread Andrew Cooper

On 27/01/15 08:35, Jan Beulich wrote:
 On 27.01.15 at 02:51,  wrote:
> Even if David told you this would be acceptable, I have to question
> an abstract model of fixing issues on only 64-bit kernels - this may
> be acceptable for distro purposes, but seems hardly the right
> approach for upstream. If 32-bit ones are to become deliberately
> broken, the XEN config option should become dependent on !X86_32.

There are still legitimate reasons to prefer 32bit PV guests over 64bit
ones.  Previous versions of this patch had 32bit support as well.  Why
did you drop it?

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH] x86, paravirt, xen: Remove the 64-bit irq_enable_sysexit pvop

2015-04-06 Thread Andrew Cooper

On 06/04/2015 16:29, Andy Lutomirski wrote:
> On Mon, Apr 6, 2015 at 7:10 AM, Konrad Rzeszutek Wilk
>  wrote:
>> On Fri, Apr 03, 2015 at 03:52:30PM -0700, Andy Lutomirski wrote:
>>> [cc: Boris and Konrad.  Whoops]
>>>
>>> On Fri, Apr 3, 2015 at 3:51 PM, Andy Lutomirski  wrote:
 We don't use irq_enable_sysexit on 64-bit kernels any more.  Remove
>> Is there an commit (or name of patch) that explains why 
>> 32-bit-user-space-on-64-bit
>> kernels is unsavory?
> sysexit never tasted very good :-p
>
> We're (hopefully) not breaking 32-bit-user-space-on-64-bit, but we're
> trying an unconventional approach to making the code faster and less
> scary.  As a result, 64-bit kernels won't use sysexit any more.
> Hopefully Xen is okay with the slightly sneaky thing we're doing.
> AFAICT Xen thinks of sysretl and sysexit as slightly funny irets, so I
> don't expect there to be any problem.

64bit PV kernels must bounce through Xen to switch from the kernel to
the user pagetables (since both kernel and userspace are both actually
running in ring3 with user pages).

As a result, exit to userspace ends up as a hypercall into Xen which has
an effect very similar to an `iret`, but with some extra fixup in the
background.

I can't forsee any Xen issues as a result of this patch.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] [RFC] x86/cpu: Fix SMAP check in PVOPS environments

2015-04-20 Thread Andrew Cooper

There appears to be no formal statement of what pv_irq_ops.save_fl() is
supposed to return precisely.  Native returns the full flags, while lguest and
Xen only return the Interrupt Flag, and both have comments by the
implementations stating that only the Interrupt Flag is looked at.  This may
have been true when initially implemented, but no longer is.

To make matters worse, the Xen PVOP leaves the upper bits undefined, making
the BUG_ON() undefined behaviour.  Experimentally, this now trips for 32bit PV
guests on Broadwell hardware.  The BUG_ON() is consistent for an individual
build, but not consistent for all builds.  It has also been a sitting timebomb
since SMAP support was introduced.

Use native_save_fl() instead, which will obtain an accurate view of the AC
flag.

Signed-off-by: Andrew Cooper 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: H. Peter Anvin 
CC: x...@kernel.org
CC: linux-kernel@vger.kernel.org
CC: Konrad Rzeszutek Wilk 
CC: Boris Ostrovsky 
CC: David Vrabel 
CC: xen-devel 
CC: Rusty Russell 
CC: lgu...@lists.ozlabs.org

---
This patch is RFC because I am not certain that native_save_fl() is
necessarily the correct solution on lguest, but it does seem that setup_smap()
wants to check the actual AC bit, rather than an idealised value.

A different approach, given the dual nature of the AC flag now is to gate
setup_smap() on a kernel rpl of 0.  SMAP necessarily can't be used in a
paravirtual situation where the kernel runs in cpl > 0.

Another different approach would be to formally state that
pv_irq_ops.save_fl() needs to return all the flags, which would make
local_irq_save() safe to use in this circumstance, but that makes a hotpath
longer for the sake of a single boot time check.
---
 arch/x86/kernel/cpu/common.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a62cf04..4f2fded 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -291,10 +291,9 @@ __setup("nosmap", setup_disable_smap);
 
 static __always_inline void setup_smap(struct cpuinfo_x86 *c)
 {
-   unsigned long eflags;
+   unsigned long eflags = native_save_fl();
 
/* This should have been cleared long ago */
-   raw_local_save_flags(eflags);
BUG_ON(eflags & X86_EFLAGS_AC);
 
if (cpu_has(c, X86_FEATURE_SMAP)) {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [RFC] x86/cpu: Fix SMAP check in PVOPS environments

2015-04-21 Thread Andrew Cooper

On 21/04/2015 01:35, Andy Lutomirski wrote:
> On 04/20/2015 10:09 AM, Andrew Cooper wrote:
>> There appears to be no formal statement of what pv_irq_ops.save_fl() is
>> supposed to return precisely.  Native returns the full flags, while
>> lguest and
>> Xen only return the Interrupt Flag, and both have comments by the
>> implementations stating that only the Interrupt Flag is looked at. 
>> This may
>> have been true when initially implemented, but no longer is.
>>
>> To make matters worse, the Xen PVOP leaves the upper bits undefined,
>> making
>> the BUG_ON() undefined behaviour.  Experimentally, this now trips for
>> 32bit PV
>> guests on Broadwell hardware.  The BUG_ON() is consistent for an
>> individual
>> build, but not consistent for all builds.  It has also been a sitting
>> timebomb
>> since SMAP support was introduced.
>>
>> Use native_save_fl() instead, which will obtain an accurate view of
>> the AC
>> flag.
>>
>> Signed-off-by: Andrew Cooper 
>> CC: Thomas Gleixner 
>> CC: Ingo Molnar 
>> CC: H. Peter Anvin 
>> CC: x...@kernel.org
>> CC: linux-kernel@vger.kernel.org
>> CC: Konrad Rzeszutek Wilk 
>> CC: Boris Ostrovsky 
>> CC: David Vrabel 
>> CC: xen-devel 
>> CC: Rusty Russell 
>> CC: lgu...@lists.ozlabs.org
>>
>> ---
>> This patch is RFC because I am not certain that native_save_fl() is
>> necessarily the correct solution on lguest, but it does seem that
>> setup_smap()
>> wants to check the actual AC bit, rather than an idealised value.
>>
>> A different approach, given the dual nature of the AC flag now is to
>> gate
>> setup_smap() on a kernel rpl of 0.  SMAP necessarily can't be used in a
>> paravirtual situation where the kernel runs in cpl > 0.
>>
>> Another different approach would be to formally state that
>> pv_irq_ops.save_fl() needs to return all the flags, which would make
>> local_irq_save() safe to use in this circumstance, but that makes a
>> hotpath
>> longer for the sake of a single boot time check.
>
> ...which reminds me:
>
> Why does native_restore_fl restore anything other than IF?  A branch
> and sti should be considerably faster than popf.

I was wondering about the performance aspect, given a comment in your
patch which removed sysret64, but hadn't had time to investigate yet.

Unfortunately, irq_save()/irq_enable()/irq_restore() appears to be a
used pattern in the kernel, making the irq_restore() disable interrupts.

The performance improvement might be worth explicitly moving the onus
into the caller with irq_maybe_disable()/irq_maybe_enable(), but that
does involve altering a lot of common code for an architecture specific
gain.

>
> Also, if we did this, could Xen use PVI and then use native_restore_fl
> and avoid lots of pvops?

Xen HVM guests already use the native pvops in this area, so would
benefit from any improvement.  PV guests on the other hand run with cpl
> 0 and instead have a writeable mask in a piece of shared memory with
Xen, and need the pvop.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86/cpu: Fix SMAP check in PVOPS environments

2015-06-03 Thread Andrew Cooper

There appears to be no formal statement of what pv_irq_ops.save_fl() is
supposed to return precisely.  Native returns the full flags, while lguest and
Xen only return the Interrupt Flag, and both have comments by the
implementations stating that only the Interrupt Flag is looked at.  This may
have been true when initially implemented, but no longer is.

To make matters worse, the Xen PVOP leaves the upper bits undefined, making
the BUG_ON() undefined behaviour.  Experimentally, this now trips for 32bit PV
guests on Broadwell hardware.  The BUG_ON() is consistent for an individual
build, but not consistent for all builds.  It has also been a sitting timebomb
since SMAP support was introduced.

Use native_save_fl() instead, which will obtain an accurate view of the AC
flag.

Signed-off-by: Andrew Cooper 
Reviewed-by: David Vrabel 
Tested-by: Rusty Russell 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: H. Peter Anvin 
CC: x...@kernel.org
CC: linux-kernel@vger.kernel.org
CC: Konrad Rzeszutek Wilk 
CC: Boris Ostrovsky 
CC: xen-devel 
CC: lgu...@lists.ozlabs.org
CC: sta...@vger.kernel.org
---
 arch/x86/kernel/cpu/common.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a62cf04..4f2fded 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -291,10 +291,9 @@ __setup("nosmap", setup_disable_smap);
 
 static __always_inline void setup_smap(struct cpuinfo_x86 *c)
 {
-   unsigned long eflags;
+   unsigned long eflags = native_save_fl();
 
/* This should have been cleared long ago */
-   raw_local_save_flags(eflags);
BUG_ON(eflags & X86_EFLAGS_AC);
 
if (cpu_has(c, X86_FEATURE_SMAP)) {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86/cpu: Fix SMAP check in PVOPS environments

2015-06-04 Thread Andrew Cooper

On 04/06/15 07:38, H. Peter Anvin wrote:
> On 06/03/2015 02:31 AM, Andrew Cooper wrote:
>> There appears to be no formal statement of what pv_irq_ops.save_fl() is
>> supposed to return precisely.  Native returns the full flags, while lguest 
>> and
>> Xen only return the Interrupt Flag, and both have comments by the
>> implementations stating that only the Interrupt Flag is looked at.  This may
>> have been true when initially implemented, but no longer is.
>>
>> To make matters worse, the Xen PVOP leaves the upper bits undefined, making
>> the BUG_ON() undefined behaviour.  Experimentally, this now trips for 32bit 
>> PV
>> guests on Broadwell hardware.  The BUG_ON() is consistent for an individual
>> build, but not consistent for all builds.  It has also been a sitting 
>> timebomb
>> since SMAP support was introduced.
>>
>> Use native_save_fl() instead, which will obtain an accurate view of the AC
>> flag.
> Could we fix the Xen pvops wrapper instead to not do things like this?
>
>   -hpa
>
>

We could, and I have a patch for that, but the check would still then be
broken in lguest, and it makes a hotpath rather longer.

Either pv_irq_ops.save_fl() gets defined to handle all flags, and Xen &
lguest need correcting in this regard, or save_fl() gets defined to
handle the interrupt flag only, and this becomes the single problematic
caller in the codebase.

The problem with expanding save_fl() to handle all flags is that
restore_fl() should follow suit, and there are a number of system flags
are inapplicable in such a situation.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH 95/98] HACK: fix include/uapi/xen/privcmd.h compilation in userspace

2015-05-30 Thread Andrew Cooper

On 30/05/15 16:39, Mikko Rapeli wrote:
> privcmd.h depends on xen/interface/xen.h which is now exported to userspace.
> xen/interface/xen.h then depends on asm/xen/interface.h which is now
> exported to userspace together with its dependencies asm/xen/interface_32.h,
> asm/xen/interface_64.h and asm/pvclock-abi.h on x86 architecture.
>
> Then all of these headers were fixed to use __u8 etc from linux/types.h
> instead of custom types.
>
> Then define uint64_t and uint32_t if needed.
>
> After all these changes these header files now compile in userspace too
> on x86.
>
> HACK since I have no idea if this is correct way to fix this.
>
> Signed-off-by: Mikko Rapeli 

Historical reasons expect xen/interface/xen.h to be a verbatim copy of
the master version kept in the Xen repository.

The only things privcmd.h need from interface/xen.h are domid_t
(uint16_t), and xen_pfn_t (unsigned long, or uint64_t in arm32 iirc),
while the majority of the rest absolutely shouldn't be in Linux’s uapi.

It would probably be best to provide just these two types and nuke the
"#include ".  Any user of privcmd needs to include
the Xen interface themselves anyway, and will have a different copy
which includes the toolstack half of the interface (missing from the
kernel copy) which is the useful set of definitions for driving the
privcmd ioctl()s.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH 2/4] xen/PVH: Use proper CS selector in long mode

2018-05-02 Thread Andrew Cooper

On 02/05/18 16:09, Jan Beulich wrote:
 On 02.05.18 at 17:08,  wrote:
>> On 05/02/2018 11:00 AM, Jan Beulich wrote:
>> On 02.05.18 at 16:57,  wrote:
 On 05/02/2018 04:05 AM, Jan Beulich wrote:
 On 30.04.18 at 18:23,  wrote:
>> Signed-off-by: Boris Ostrovsky 
> Reviewed-by: Jan Beulich 
>
> But to understand why things have been working nevertheless it would
> have been nice if the commit message wasn't empty, but instead said
> something like "The two happen to be identical on 64-bit."
 Why do you think they are identical? __KERNEL_CS points to entry#12
 (which we don't specify in PVH GDT) while __BOOT_CS is the second entry
 (which we do create).
>>> That's 32-bit's __KERNEL_CS. If the two weren't identical, the ljmp
>>> you adjust would never have worked afaict.
>>
>> Oh, right. My theory was that we were picking up something from the
>> stack (which is where 12th entry would be pointing) and the L bit, which
>> I think is the only one we'd care about, happened to always be set there.
> I don't think the L bit is the only one we care about, as I don't think you
> can load a non-code selector into CS (even if none of the attributes are
> later used for anything).

The type/s/dpl/p/d/l attributes still very much matter even in 64bit.

~Andrew

Re: [Xen-devel] [PATCH] xen/privcmd: fix static checker warning

2018-06-07 Thread Andrew Cooper

On 07/06/18 11:21, Paul Durrant wrote:
> Commit 3ad0876554ca ("xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE")
> introduced a static checker warning:
>
>   drivers/xen/privcmd.c:827 privcmd_ioctl_mmap_resource()
>   warn: passing casted pointer 'pfns' to 'xen_remap_domain_mfn_array()'
> 64 vs 32.
>
> caused by this cast:
>
>827  num = xen_remap_domain_mfn_array(vma,
>828   kdata.addr & PAGE_MASK,
>829   pfns, kdata.num, (int *)pfns,
>   ^^^
>
> The reason for the cast is that xen_remap_domain_mfn_array() requires an
> array of ints to store error codes. It is actually safe to re-use the
> pfns array for this purpose but it does look odd (as well as leading to
> the warning). It would also be easy for a future implementation change
> to make this re-use unsafe so this patch modifies privcmd to use a
> separately allocated array for error codes.
>
> Reported-by: Dan Carpenter 
> Signed-off-by: Paul Durrant 

It may be safe to reuse pfns[] as the storage space for the errs array,
but code is incorrect when sizeof(pfn) != sizeof(int).  In such a case,
you skip over every other err, and second half of pfns[] is junk from
the point of view of the errs loop.

~Andrew

> ---
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Dan Carpenter 
> ---
>  drivers/xen/privcmd.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 8ae0349..8507c13 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -822,11 +822,18 @@ static long privcmd_ioctl_mmap_resource(struct file 
> *file, void __user *udata)
>   unsigned int domid =
>   (xdata.flags & XENMEM_rsrc_acq_caller_owned) ?
>   DOMID_SELF : kdata.dom;
> + int *errs;
>   int num;
>  
> + errs = kcalloc(kdata.num, sizeof(*errs), GFP_KERNEL);
> + if (!errs) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
>   num = xen_remap_domain_mfn_array(vma,
>kdata.addr & PAGE_MASK,
> -  pfns, kdata.num, (int *)pfns,
> +  pfns, kdata.num, errs,
>vma->vm_page_prot,
>domid,
>vma->vm_private_data);
> @@ -836,12 +843,14 @@ static long privcmd_ioctl_mmap_resource(struct file 
> *file, void __user *udata)
>   unsigned int i;
>  
>   for (i = 0; i < num; i++) {
> - rc = pfns[i];
> + rc = errs[i];
>   if (rc < 0)
>   break;
>   }
>   } else
>   rc = 0;
> +
> + kfree(errs);
>   }
>  
>  out:

Re: [Xen-devel] [PATCH] xen-netback: fix occasional leak of grant ref mappings under memory pressure

2019-02-28 Thread Andrew Cooper

On 28/02/2019 02:03, Igor Druzhinin wrote:
> Zero-copy callback flag is not yet set on frag list skb at the moment
> xenvif_handle_frag_list() returns -ENOMEM. This eventually results in
> leaking grant ref mappings since xenvif_zerocopy_callback() is never
> called for these fragments. Those eventually build up and cause Xen
> to kill Dom0 as the slots get reused for new mappings.

Its worth pointing out what (debug) Xen notices is dom0 performing
implicit grant unmap.

~Andrew

Re: [Xen-devel] xen: Can't insert balloon page into VM userspace (WAS Re: [linux-linus bisection] complete test-arm64-arm64-xl-xsm)

2019-03-12 Thread Andrew Cooper

On 12/03/2019 17:18, David Hildenbrand wrote:
> On 12.03.19 18:14, Matthew Wilcox wrote:
>> On Tue, Mar 12, 2019 at 05:05:39PM +, Julien Grall wrote:
>>> On 3/12/19 3:59 PM, Julien Grall wrote:
 It looks like all the arm test for linus [1] and next [2] tree
 are now failing. x86 seems to be mostly ok.

 The bisector fingered the following commit:

 commit 0ee930e6cafa048c1925893d0ca89918b2814f2c
 Author: Matthew Wilcox 
 Date:   Tue Mar 5 15:46:06 2019 -0800

  mm/memory.c: prevent mapping typed pages to userspace
  Pages which use page_type must never be mapped to userspace as it 
 would
  destroy their page type.  Add an explicit check for this instead of
  assuming that kernel drivers always get this right.
>> Oh good, it found a real problem.
>>
>>> It turns out the problem is because the balloon driver will call
>>> __SetPageOffline() on allocated page. Therefore the page has a type and
>>> vm_insert_pages will deny the insertion.
>>>
>>> My knowledge is quite limited in this area. So I am not sure how we can
>>> solve the problem.
>>>
>>> I would appreciate if someone could provide input of to fix the mapping.
>> I don't know the balloon driver, so I don't know why it was doing this,
>> but what it was doing was Wrong and has been since 2014 with:
>>
>> commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
>> Author: Konstantin Khlebnikov 
>> Date:   Thu Oct 9 15:29:27 2014 -0700
>>
>> mm/balloon_compaction: redesign ballooned pages management
>>
>> If ballooned pages are supposed to be mapped into userspace, you can't mark
>> them as ballooned pages using the mapcount field.
>>
> Asking myself why anybody would want to map balloon inflated pages into
> user space (this just sounds plain wrong but my understanding to what
> XEN balloon driver does might be limited), but I assume the easy fix
> would be to revert

I suspect the bug here is that the balloon driver is (ab)used for a
second purpose - to create a hole in pfn space to map some other bits of
shared memory into.

I think at the end of the day, what is needed is a struct page_info
which looks like normal RAM, but the backing for which can be altered by
hypercall to map other things.

~Andrew

Re: [Xen-devel] xen/evtchn and forced threaded irq

2019-02-26 Thread Andrew Cooper

On 26/02/2019 09:14, Roger Pau Monné wrote:
> On Mon, Feb 25, 2019 at 01:55:42PM +, Julien Grall wrote:
>> Hi Oleksandr,
>>
>> On 25/02/2019 13:24, Oleksandr Andrushchenko wrote:
>>> On 2/22/19 3:33 PM, Julien Grall wrote:
 Hi,

 On 22/02/2019 12:38, Oleksandr Andrushchenko wrote:
> On 2/20/19 10:46 PM, Julien Grall wrote:
>> Discussing with my team, a solution that came up would be to
>> introduce one atomic field per event to record the number of
>> event received. I will explore that solution tomorrow.
> How will this help if events have some payload?
 What payload? The event channel does not carry any payload. It only
 notify you that something happen. Then this is up to the user to
 decide what to you with it.
>>> Sorry, I was probably not precise enough. I mean that an event might have
>>> associated payload in the ring buffer, for example [1]. So, counting events
>>> may help somehow, but the ring's data may still be lost
>> From my understanding of event channels are edge interrupts. By definition,
> IMO event channels are active high level interrupts.
>
> Let's take into account the following situation: you have an event
> channel masked and the event channel pending bit (akin to the line on
> bare metal) goes from low to high (0 -> 1), then you unmask the
> interrupt and you get an event injected. If it was an edge interrupt
> you wont get an event injected after unmasking, because you would
> have lost the edge. I think the problem here is that Linux treats
> event channels as edge interrupts, when they are actually level.

Event channels are edge interrupts.  There are several very subtle bugs
to be had by software which treats them as line interrupts.

Most critically, if you fail to ack them, rebind them to a new vcpu, and
reenable interrupts, you don't get a new interrupt notification.  This
was the source of a 4 month bug when XenServer was moving from
classic-xen to PVOps where using irqbalance would cause dom0 to
occasionally lose interrupts.

~Andrew

Re: [RFC PATCH] x86/retpolines: Prevent speculation after RET

2021-02-19 Thread Andrew Cooper

On 19/02/2021 08:15, Peter Zijlstra wrote:
> On Thu, Feb 18, 2021 at 08:11:38PM +0100, Borislav Petkov wrote:
>> On Thu, Feb 18, 2021 at 08:02:31PM +0100, Peter Zijlstra wrote:
>>> On Thu, Feb 18, 2021 at 07:46:39PM +0100, Borislav Petkov wrote:
 Both vendors speculate after a near RET in some way:

 Intel:

 "Unlike near indirect CALL and near indirect JMP, the processor will not
 speculatively execute the next sequential instruction after a near RET
 unless that instruction is also the target of a jump or is a target in a
 branch predictor."
>>> Right, the way I read that means it's not a problem for us here.
>> Look at that other thread: the instruction *after* the RET can be
>> speculatively executed if that instruction is the target of a jump or it
>> is in a branch predictor.
> Right, but that has nothing to do with the RET instruction itself. You
> can speculatively execute any random instruction by training the BTB,
> which is I suppose the entire point of things :-)
>
> So the way I read it is that: RET does not 'leak' speculation, but if
> you target the instruction after RET with any other speculation crud,
> ofcourse you can get it to 'run'.
>
> And until further clarified, I'll stick with that :-)

https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf
Final page, Mitigation G-5

Some parts (before Milan I believe that CPUID rule translates into) may
speculatively execute the instructions sequentially following a call/jmp
indirect or ret instruction.

For Intel, its just call/jmp instructions.  From SDM Vol2 for CALL (and
similar for JMP)

"Certain situations may lead to the next sequential instruction after a
near indirect CALL being speculatively executed. If software needs to
prevent this (e.g., in order to prevent a speculative execution side
channel), then an LFENCE instruction opcode can be placed after the near
indirect CALL in order to block speculative execution."

In both cases, the reason LFENCE is given is for the CALL case, where
there is sequential architectural execution.  JMP and RET do not have
architectural execution following them, so can use a shorter speculation
blocker.

When compiling with retpoline, all CALL/JMP indirects are removed, other
than within the __x86_indirect_thunk_%reg blocks, and those can be fixed
by hand.  That just leaves RET speculation, which has no following
architectural execution, at which point `ret; int3` is the shortest way
of halting speculation, at half the size of `ret; lfence`.

With a gcc toolchain, it does actually work if you macro 'ret' (and
retl/q) to be .byte 0xc3, 0xcc, but this doesn't work for Clang IAS
which refuses to macro real instructions.

What would be massively helpful if is the toolchains could have their
existing ARM straight-line-speculation support hooked up appropriately
so we get some new code gen options on x86, and don't have to resort to
the macro bodges above.

~Andrew

Re: [PATCH] x86/debug: 'Fix' ptrace dr6 output

2021-01-28 Thread Andrew Cooper

On 28/01/2021 18:20, Peter Zijlstra wrote:
> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -496,9 +496,32 @@ static int hw_breakpoint_handler(struct
>   dr6_p = (unsigned long *)ERR_PTR(args->err);
>   dr6 = *dr6_p;
>  
> - /* If it's a single step, TRAP bits are random */
> - if (dr6 & DR_STEP)
> + /*
> +  * If DR_STEP is set, TRAP bits might also be set, but we must not
> +  * process them since another exception (without DR_STEP) will follow.

:) How lucky are you feeling?

Data breakpoints will in principle merge with DR_STEP (as will all other
#DB's with trap semantics), other than the many errata about breakpoints
not being recognised correctly.

Under VT-x because there is a still unfixed vmexit microcode bug which
loses all breakpoint information if DR_STEP is set.  %dr6 handling works
fine when #DB isn't intercepted, but then you get to keep the pipeline
livelock vulnerability as a consequence.

Instruction breakpoints on the following instruction will be delivered
as a second #DB, because fault style #DBs are raised at a separate
position in the instruction execution cycle.

~Andrew

Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Andrew Cooper

On 31/01/2021 02:59, Andy Lutomirski wrote:
 diff --git a/arch/x86/include/asm/tlbflush.h 
 b/arch/x86/include/asm/tlbflush.h
 index 8c87a2e0b660..a617dc0a9b06 100644
 --- a/arch/x86/include/asm/tlbflush.h
 +++ b/arch/x86/include/asm/tlbflush.h
 @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
 arch_tlbflush_unmap_batch *batch,

 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

 +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
 +{
 +   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
 +_PAGE_SOFTW3 | _PAGE_ACCESSED;
>>> Why is accessed ignored?  Surely clearing the accessed bit needs a
>>> flush if the old PTE is present.
>> I am just following the current scheme in the kernel (x86):
>>
>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>unsigned long address, pte_t *ptep)
>> {
>> /*
>>  * On x86 CPUs, clearing the accessed bit without a TLB flush
>>  * doesn't cause data corruption. [ It could cause incorrect
>>  * page aging and the (mistaken) reclaim of hot pages, but the
>>  * chance of that should be relatively low. ]
>>  *
> If anyone ever implements the optimization of skipping the flush when
> unmapping a !accessed page, then this will cause data corruption.
>
> If nothing else, this deserves a nice comment in the new code.

Architecturally, access bits get as part of a pagewalk, and before the
TLB entry is made.  Once an access bit has been set, you know for
certain that a mapping for the address is, or was, present in a TLB
somewhere.

I can't help but feel that this "optimisation" is shooting itself in the
foot.  It does cause incorrect page aging, and TLBs are getting bigger,
so the chance of reclamating hot pages is getting worse and worse.

~Andrew

Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Andrew Cooper

On 31/01/2021 01:07, Andy Lutomirski wrote:
> Adding Andrew Cooper, who has a distressingly extensive understanding
> of the x86 PTE magic.

Pretty sure it is all learning things the hard way...

> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 632d5a677d3f..b7473d2c9a1f 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather 
>> *tlb,
>> ptent = pte_mkwrite(ptent);
>> }
>> ptep_modify_prot_commit(vma, addr, pte, oldpte, 
>> ptent);
>> -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> +   if (pte_may_need_flush(oldpte, ptent))
>> +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);

You're choosing to avoid the flush, based on A/D bits read ahead of the
actual modification of the PTE.

In this example, another thread can write into the range (sets A and D),
and get a suitable TLB entry which goes unflushed while the rest of the
kernel thinks the memory is write-protected and clean.

The only safe way to do this is to use XCHG/etc to modify the PTE, and
base flush calculations on the results.  Atomic operations are ordered
with A/D updates from pagewalks on other CPUs, even on AMD where A
updates are explicitly not ordered with regular memory reads, for
performance reasons.

~Andrew

Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-02-01 Thread Andrew Cooper

On 01/02/2021 05:58, Nadav Amit wrote:
>> On Jan 31, 2021, at 4:10 AM, Andrew Cooper  wrote:
>>
>> On 31/01/2021 01:07, Andy Lutomirski wrote:
>>> Adding Andrew Cooper, who has a distressingly extensive understanding
>>> of the x86 PTE magic.
>> Pretty sure it is all learning things the hard way...
>>
>>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 632d5a677d3f..b7473d2c9a1f 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct 
>>>> mmu_gather *tlb,
>>>>ptent = pte_mkwrite(ptent);
>>>>}
>>>>ptep_modify_prot_commit(vma, addr, pte, oldpte, 
>>>> ptent);
>>>> -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>> +   if (pte_may_need_flush(oldpte, ptent))
>>>> +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> You're choosing to avoid the flush, based on A/D bits read ahead of the
>> actual modification of the PTE.
>>
>> In this example, another thread can write into the range (sets A and D),
>> and get a suitable TLB entry which goes unflushed while the rest of the
>> kernel thinks the memory is write-protected and clean.
>>
>> The only safe way to do this is to use XCHG/etc to modify the PTE, and
>> base flush calculations on the results.  Atomic operations are ordered
>> with A/D updates from pagewalks on other CPUs, even on AMD where A
>> updates are explicitly not ordered with regular memory reads, for
>> performance reasons.
> Thanks Andrew for the feedback, but I think the patch does it exactly in
> this safe manner that you describe (at least on native x86, but I see a
> similar path elsewhere as well):
>
> oldpte = ptep_modify_prot_start()
> -> __ptep_modify_prot_start()
> -> ptep_get_and_clear
> -> native_ptep_get_and_clear()
> -> xchg()
>
> Note that the xchg() will clear the PTE (i.e., making it non-present), and
> no further updates of A/D are possible until ptep_modify_prot_commit() is
> called.
>
> On non-SMP setups this is not atomic (no xchg), but since we hold the lock,
> we should be safe.
>
> I guess you are right and a pte_may_need_flush() deserves a comment to
> clarify that oldpte must be obtained by an atomic operation to ensure no A/D
> bits are lost (as you say).
>
> Yet, I do not see a correctness problem. Am I missing something?

No(ish) - I failed to spot that path.

But native_ptep_get_and_clear() is broken on !SMP builds.  It needs to
be an XCHG even in that case, to spot A/D updates from prefetch or
shared-virtual-memory DMA.

~Andrew

Re: [RFC][PATCH 2/2] x86: add extra serialization for non-serializing MSRs

2021-02-04 Thread Andrew Cooper

On 05/03/2020 17:47, Dave Hansen wrote:
> Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a
> plain "mfence" while the Intel SDM (10.12.3 MSR Access in x2APIC
> Mode) calls for "mfence;lfence".
>
> Short summary: we have special MSRs that have weaker ordering
> than all the rest.  Add fencing consistent with current SDM
> recommendatrions.
>
> This is not known to cause any issues in practice, only in
> theory.

So, I accept that Intel have their own reasons for what is written in
the SDM, but "not ordered with stores" is at best misleading.

The x2APIC (and other) MSRs, aren't serialising.  That's fine, as is the
fact that the WRMSR to trigger them doesn't have memory operands, and is
therefore not explicitly ordered with other loads and stores.

Consider:
    xor %edi, %edi
    movb (%rdi), %dl
    wrmsr

It is fine for a non-serialising wrmsr here to execute speculative in
terms of internal calculations, but nothing it does can escape the local
core until the movb has fully retired, and is therefore globally visible.

Otherwise, I can send IPIs from non-architectural paths (in this case,
behind a page fault), and causality is broken.

IPIs are (at minimum) a write-like-thing leaving the core, even if they
don't interact with the regular memory path, and their effects cannot
become visible until the effects of older instructions are visible.

What the SDM is trying to say is that this potentially matters for
writes queued in the WC buffers.

If some code is using WC memory, and wants to send an IPI, and wants to
have the remote IPI handler read said data, then yes - there is a
problem - but the problem is the lack of SFENCE required to make the WC
buffer visible in the first place.

WC code is already responsible for its own memory ordering, and the
x2APIC IPIs can't execute early even in the absence of architectural
ordering guarantees.

~Andrew

Re: [RFC][PATCH 2/2] x86: add extra serialization for non-serializing MSRs

2021-02-05 Thread Andrew Cooper

On 05/02/2021 10:02, Peter Zijlstra wrote:
> On Thu, Feb 04, 2021 at 04:11:12PM -0800, Andy Lutomirski wrote:
>> I'm wondering if a more mild violation is possible:
>>
>> Initialize *addr = 0.
>>
>> mov $1, (addr)
>> wrmsr
>>
>> remote cpu's IDT vector:
>>
>> mov (addr), %rax
>> %rax == 0!
>>
>> There's no speculative-execution-becoming-visible-even-if-it-doesn't-retire
>> here -- there's just an ordering violation.  For Linux, this would
>> presumably only manifest as a potential deadlock or confusion if the
>> IPI vector code looks at the list of pending work and doesn't find the
>> expected work in it.
>>
>> Dave?  hpa?  What is the SDM trying to tell us?
> [ Big caveat, I've not spoken to any hardware people about this. The
> below is purely my own understanding. ]
>
> This is my interpretation as well. Without the MFENCE+LFENCE there is no
> guarantee the store is out of the store-buffer and the remote load isn't
> guaranteed to observe it.
>
> What I think the SDM is trying to tell us, is that the IPI, even if it
> goes on the same regular coherency fabric as memory transfers, is not
> subject to the regular memory ordering rules.
>
> Normal TSO rules tells us that when:
>
> P1() {
>   x = 1;
>   y = 1;
> }
>
> P2() {
>   r1 = y;
>   r2 = x;
> }
>
> r2 must not be 0 when r1 is 1. Because if we see store to y, we must
> also see store to x. But the IPI thing doesn't behave like a store. The
> (fast) wrmsr isn't even considered a memop.
>
> The thing is, the above ordering does not guarantee we have r2 != 0.
> r2==0 is allowed when r1==0. And that's an entirely sane outcome even if
> we run the instructions like:
>
>   CPU1CPU2
>
> cycle-1   mov $1, ([x])
> cycle-2   mov $1, ([y])
> cycle-3   mov ([y]), rax
> cycle-4   mov ([x]), rbx
>
> There is no guarantee _any_ of the stores will have made it out. And
> that's exactly the issue. The IPI might make it out of the core before
> any of the stores will.
>
> Furthermore, since there is no dependency between:
>
>   mov $1, ([x])
>   wrmsr
>
> The CPU is allowed to reorder the execution and retire the wrmsr before
> the store. Very much like it would for normal non-dependent
> instructions.

Execution, sure (for details which don't escape the core, just like any
other speculative execution).  Retirement, surely not - it is inherently
tied to the program order of things.

Causality would also be broken if the WRMSR retired ahead of the MOV,
and an interrupt were to hit the boundary between them.

> And presumably it is still allowed to do that when we write it like:
>
>   mov $1, ([x])
>   mfence
>   wrmsr
>
> because, mfence only has dependencies to memops and (fast) wrmsr is not
> a memop.
>
> Which then brings us to:
>
>   mov $1, ([x])
>   mfence
>   lfence
>   wrmsr
>
> In this case, the lfence acts like the newly minted ifence (see
> spectre), and will block execution of (any) later instructions until
> completion of all prior instructions. This, and only this ensures the
> wrmsr happens after the mfence, which in turn ensures the store to x is
> globally visible.

I understand that "what the architecture guarantees" differs from "how
parts behave in practice".

And I also understand the reasoning behind declaring that MFENCE;LFENCE
the only architecturally guaranteed way of ensuring all stores are
globally visible before the WRMSR starts.


However, what is missing is a explanation of how it is possible to build
a causality-preserving part where those fences (for plain stores) are
necessary in practice.

That sequence is a large set of pipeline stalls in practice for what
appears to a problem in theory only.

~Andrew

Re: [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT

2020-05-30 Thread Andrew Cooper

On 29/05/2020 22:27, Peter Zijlstra wrote:
> From: Lai Jiangshan 
>
> A data breakpoint on the GDT is terrifying and should be avoided.
> The GDT on CPU entry area is already protected. The direct GDT
> should be also protected, although it is seldom used and only
> used for short time.

While I agree with the sentiment...

>
> Signed-off-by: Lai Jiangshan 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: https://lkml.kernel.org/r/20200526014221.2119-3-la...@linux.alibaba.com
> ---
>  arch/x86/kernel/hw_breakpoint.c |   30 ++
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Per cpu debug control register value */
>  DEFINE_PER_CPU(unsigned long, cpu_dr7);
> @@ -237,13 +238,26 @@ static inline bool within_area(unsigned
>  }
>  
>  /*
> - * Checks whether the range from addr to end, inclusive, overlaps the CPU
> - * entry area range.
> + * Checks whether the range from addr to end, inclusive, overlaps the fixed
> + * mapped CPU entry area range or other ranges used for CPU entry.
>   */
> -static inline bool within_cpu_entry_area(unsigned long addr, unsigned long 
> end)
> +static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
>  {
> - return within_area(addr, end, CPU_ENTRY_AREA_BASE,
> -CPU_ENTRY_AREA_TOTAL_SIZE);
> + int cpu;
> +
> + /* CPU entry erea is always used for CPU entry */
> + if (within_area(addr, end, CPU_ENTRY_AREA_BASE,
> + CPU_ENTRY_AREA_TOTAL_SIZE))
> + return true;
> +
> + for_each_possible_cpu(cpu) {
> + /* The original rw GDT is being used after load_direct_gdt() */
> + if (within_area(addr, end, (unsigned long)get_cpu_gdt_rw(cpu),
> + GDT_SIZE))

... why the O(n) loop over the system?

It is only GDTs which might ever be active on this local CPU(/thread)
which are a problem, because the breakpoint registers are similarly local.

Nothing is going to go wrong If I put a breakpoint on someone else's
live GDT, because they wont interact in the "fun" ways we're trying to
avoid.

~Andrew

Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY

2020-05-20 Thread Andrew Cooper

On 20/05/2020 09:06, Jürgen Groß wrote:
> On 19.05.20 21:44, Andy Lutomirski wrote:
>> On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner 
>> wrote:
>>>
>>> Andy Lutomirski  writes:
 B: Turn this thing around.  Specifically, in the one and only case we
 care about, we know pretty much exactly what context we got this entry
 in: we're running in a schedulable context doing an explicitly
 preemptible hypercall, and we have RIP pointing at a SYSCALL
 instruction (presumably, but we shouldn't bet on it) in the hypercall
 page.  Ideally we would change the Xen PV ABI so the hypercall would
 return something like EAGAIN instead of auto-restarting and we could
 ditch this mess entirely.  But the ABI seems to be set in stone or at
 least in molasses, so how about just:

 idt_entry(exit(regs));
 if (inhcall && need_resched())
    schedule();
>>>
>>> Which brings you into the situation that you call schedule() from the
>>> point where we just moved it out. If we would go there we'd need to
>>> ensure that RCU is watching as well. idtentry_exit() might have it
>>> turned off 
>>
>> I don't think this is possible.  Once you untangle all the wrappers,
>> the call sites are effectively:
>>
>> __this_cpu_write(xen_in_preemptible_hcall, true);
>> CALL_NOSPEC to the hypercall page
>> __this_cpu_write(xen_in_preemptible_hcall, false);
>>
>> I think IF=1 when this happens, but I won't swear to it.  RCU had
>> better be watching.
>
> Preemptible hypercalls are never done with interrupts off. To be more
> precise: they are only ever done during ioctl() processing.
>
> I can add an ASSERT() to xen_preemptible_hcall_begin() if you want.
>
>>
>> As I understand it, the one and only situation Xen wants to handle is
>> that an interrupt gets delivered during the hypercall.  The hypervisor
>> is too clever for its own good and deals with this by rewinding RIP to
>> the beginning of whatever instruction did the hypercall and delivers
>> the interrupt, and we end up in this handler.  So, if this happens,
>> the idea is to not only handle the interrupt but to schedule if
>> scheduling would be useful.
>
> Correct. More precise: the hypercalls in question can last very long
> (up to several seconds) and so they need to be interruptible. As said
> before: the interface how this is done is horrible. :-(

Forget seconds.  DOMCTL_domain_kill gets to ~14 minutes for a 2TB domain.

The reason for the existing logic is to be able voluntarily preempt.

It doesn't need to remain the way it is, but some adequate form of
pre-emption does need to stay.

~Andrew

Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-11 Thread Andrew Cooper

On 11/01/2021 22:09, boris.ostrov...@oracle.com wrote:
> On 1/11/21 10:29 AM, Roger Pau Monne wrote:
>>  
>> +xdata.domid = kdata.dom;
>> +xdata.type = kdata.type;
>> +xdata.id = kdata.id;
>> +
>> +if (!kdata.addr && !kdata.num) {
>
> I think we should not allow only one of them to be zero. If it's only 
> kdata.num then we will end up with pfns array set to ZERO_SIZE_PTR (which is 
> 0x10). We seem to be OK in that we are not derefencing pfns (either in kernel 
> or in hypervisor) if number of frames is zero but IMO we shouldn't be 
> tempting the fate.
>
>
> (And if it's only kdata.addr then we will get a vma but I am not sure it will 
> do what we want.)

Passing addr == 0 without num being 0 is already an error in Xen, and
passing num == 0 without addr being 0 is bogus and will be an error by
the time I'm finished fixing this.

FWIW, the common usecase for non-trivial examples will be:

xenforeignmem_resource_size(domid, type, id, &size);
xenforeignmem_map_resource(domid, type, id, NULL, size, ...);

which translates into:

ioctl(MAP_RESOURCE, NULL, 0) => size
mmap(NULL, size, ...) => ptr
ioctl(MAP_RESOURCE, ptr, size)

from the kernels point of view, and two hypercalls from Xen's point of
view.  The NULL's above are expected to be the common case for letting
the kernel chose the vma, but ought to be filled in by the time the
second ioctl() occurs.

See
https://lore.kernel.org/xen-devel/20200922182444.12350-1-andrew.coop...@citrix.com/T/#u
for all the gory details.

~Andrew

Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Andrew Cooper

On 11/12/2020 21:27, Thomas Gleixner wrote:
> On Fri, Dec 11 2020 at 09:29, boris ostrovsky wrote:
>
>> On 12/11/20 7:37 AM, Thomas Gleixner wrote:
>>> On Fri, Dec 11 2020 at 13:10, Jürgen Groß wrote:
 On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:
> On 12/10/20 2:26 PM, Thomas Gleixner wrote:
>> Change the implementation so that the channel is bound to CPU0 at the XEN
>> level and leave the affinity mask alone. At startup of the interrupt
>> affinity will be assigned out of the affinity mask and the XEN binding 
>> will
>> be updated.
> If that's the case then I wonder whether we need this call at all and 
> instead bind at startup time.
 After some discussion with Thomas on IRC and xen-devel archaeology the
 result is: this will be needed especially for systems running on a
 single vcpu (e.g. small guests), as the .irq_set_affinity() callback
 won't be called in this case when starting the irq.
>> On UP are we not then going to end up with an empty affinity mask? Or
>> are we guaranteed to have it set to 1 by interrupt generic code?
> An UP kernel does not ever look on the affinity mask. The
> chip::irq_set_affinity() callback is not invoked so the mask is
> irrelevant.
>
> A SMP kernel on a UP machine sets CPU0 in the mask so all is good.
>
>> This is actually why I brought this up in the first place --- a
>> potential mismatch between the affinity mask and Xen-specific data
>> (e.g. info->cpu and then protocol-specific data in event channel
>> code). Even if they are re-synchronized later, at startup time (for
>> SMP).
> Which is not a problem either. The affinity mask is only relevant for
> setting the affinity, but it's not relevant for delivery and never can
> be.
>
>> I don't see anything that would cause a problem right now but I worry
>> that this inconsistency may come up at some point.
> As long as the affinity mask becomes not part of the event channel magic
> this should never matter.
>
> Look at it from hardware:
>
> interrupt is affine to CPU0
>
>  CPU0 runs:
>  
>  set_affinity(CPU0 -> CPU1)
> local_irq_disable()
> 
>  --> interrupt is raised in hardware and pending on CPU0
>
> irq hardware is reconfigured to be affine to CPU1
>
> local_irq_enable()
>
>  --> interrupt is handled on CPU0
>
> the next interrupt will be raised on CPU1
>
> So info->cpu which is registered via the hypercall binds the 'hardware
> delivery' and whenever the new affinity is written it is rebound to some
> other CPU and the next interrupt is then raised on this other CPU.
>
> It's not any different from the hardware example at least not as far as
> I understood the code.

Xen's event channels do have a couple of quirks.

Binding an event channel always results in one spurious event being
delivered.  This is to cover notifications which can get lost during the
bidirectional setup, or re-setups in certain configurations.

Binding an interdomain or pirq event channel always defaults to vCPU0. 
There is no way to atomically set the affinity while binding.  I believe
the API predates SMP guest support in Xen, and noone has fixed it up since.

As a consequence, the guest will observe the event raised on vCPU0 as
part of setting up the event, even if it attempts to set a different
affinity immediately afterwards.  A little bit of care needs to be taken
when binding an event channel on vCPUs other than 0, to ensure that the
callback is safe with respect to any remaining state needing initialisation.

Beyond this, there is nothing magic I'm aware of.

We have seen soft lockups before in certain scenarios, simply due to the
quantity of events hitting vCPU0 before irqbalance gets around to
spreading the load.  This is why there is an attempt to round-robin the
userspace event channel affinities by default, but I still don't see why
this would need custom affinity logic itself.

Thanks,

~Andrew

Re: [PATCH v2] xen/privcmd: allow fetching resource sizes

2021-01-12 Thread Andrew Cooper

On 12/01/2021 12:17, Jürgen Groß wrote:
> On 12.01.21 12:53, Roger Pau Monne wrote:
>> Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
>> addr = 0 in order to fetch the size of a specific resource.
>>
>> Add a shortcut to the default map resource path, since fetching the
>> size requires no address to be passed in, and thus no VMA to setup.
>>
>> This is missing from the initial implementation, and causes issues
>> when mapping resources that don't have fixed or known sizes.
>>
>> Signed-off-by: Roger Pau Monné 
>> Cc: sta...@vger.kernel.org # >= 4.18
>
> Reviewed-by: Juergen Gross 

Tested-by: Andrew Cooper

Re: [PATCH 17/21] x86/acpi: Convert indirect jump to retpoline

2021-01-14 Thread Andrew Cooper

On 14/01/2021 19:40, Josh Poimboeuf wrote:
> It's kernel policy to not have (unannotated) indirect jumps because of
> Spectre v2.  This one's probably harmless, but better safe than sorry.
> Convert it to a retpoline.
>
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Pavel Machek 
> Signed-off-by: Josh Poimboeuf 
> ---
>  arch/x86/kernel/acpi/wakeup_64.S | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/acpi/wakeup_64.S 
> b/arch/x86/kernel/acpi/wakeup_64.S
> index 5d3a0b8fd379..0b371580e620 100644
> --- a/arch/x86/kernel/acpi/wakeup_64.S
> +++ b/arch/x86/kernel/acpi/wakeup_64.S
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  # Copyright 2003 Pavel Machek   
> @@ -39,7 +40,7 @@ SYM_FUNC_START(wakeup_long64)
>   movqsaved_rbp, %rbp
>  
>   movqsaved_rip, %rax
> - jmp *%rax
> + JMP_NOSPEC rax
>  SYM_FUNC_END(wakeup_long64)

I suspect this won't work as you intend.

wakeup_long64() still executes on the low mappings, not the high
mappings, so the `jmp __x86_indirect_thunk_rax` under this JMP_NOSPEC
will wander off into the weeds.

This is why none of the startup "jmps from weird contexts onto the high
mappings" have been retpolined-up.

~Andrew

Re: [PATCH 17/21] x86/acpi: Convert indirect jump to retpoline

2021-01-14 Thread Andrew Cooper

On 14/01/2021 23:47, Josh Poimboeuf wrote:
> On Thu, Jan 14, 2021 at 10:59:39PM +0000, Andrew Cooper wrote:
>> On 14/01/2021 19:40, Josh Poimboeuf wrote:
>>> It's kernel policy to not have (unannotated) indirect jumps because of
>>> Spectre v2.  This one's probably harmless, but better safe than sorry.
>>> Convert it to a retpoline.
>>>
>>> Cc: "Rafael J. Wysocki" 
>>> Cc: Len Brown 
>>> Cc: Pavel Machek 
>>> Signed-off-by: Josh Poimboeuf 
>>> ---
>>>  arch/x86/kernel/acpi/wakeup_64.S | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/acpi/wakeup_64.S 
>>> b/arch/x86/kernel/acpi/wakeup_64.S
>>> index 5d3a0b8fd379..0b371580e620 100644
>>> --- a/arch/x86/kernel/acpi/wakeup_64.S
>>> +++ b/arch/x86/kernel/acpi/wakeup_64.S
>>> @@ -7,6 +7,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  # Copyright 2003 Pavel Machek >>  
>>> @@ -39,7 +40,7 @@ SYM_FUNC_START(wakeup_long64)
>>> movqsaved_rbp, %rbp
>>>  
>>> movqsaved_rip, %rax
>>> -   jmp *%rax
>>> +   JMP_NOSPEC rax
>>>  SYM_FUNC_END(wakeup_long64)
>> I suspect this won't work as you intend.
>>
>> wakeup_long64() still executes on the low mappings, not the high
>> mappings, so the `jmp __x86_indirect_thunk_rax` under this JMP_NOSPEC
>> will wander off into the weeds.
>>
>> This is why none of the startup "jmps from weird contexts onto the high
>> mappings" have been retpolined-up.
> D'oh.  Of course it wouldn't be that easy.  I suppose the other two
> retpoline patches (15 and 21) are bogus as well.

If by 21 you mean 19, then most likely, yes.  They're all low=>high
jumps in weird contexts.

~Andrew

Re: Why does glibc use AVX-512?

2021-03-26 Thread Andrew Cooper

On 26/03/2021 19:47, Andy Lutomirski wrote:
> On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer  wrote:
>> * Andy Lutomirski:
>>
> AVX-512 cleared, and programs need to explicitly request enablement.
> This would allow programs to opt into not saving/restoring across
> signals or to save/restore in buffers supplied when the feature is
> enabled.
 Isn't XSAVEOPT already able to handle that?

>>> Yes, but we need a place to put the data, and we need to acknowledge
>>> that, with the current save-everything-on-signal model, the amount of
>>> time and memory used is essentially unbounded.  This isn't great.
>> The size has to have a known upper bound, but the save amount can be
>> dynamic, right?
>>
>> How was the old lazy FPU initialization support for i386 implemented?
>>
 Assuming you can make XSAVEOPT work for you on the kernel side, my
 instincts tell me that we should have markup for RTM, not for AVX-512.
 This way, we could avoid use of the AVX-512 registers and keep using
 VZEROUPPER, without run-time transaction checks, and deal with other
 idiosyncrasies needed for transaction support that users might
 encounter once this feature sees more use.  But the VZEROUPPER vs RTM
 issues is currently stuck in some internal process issue on my end (or
 two, come to think of it), which I hope to untangle next month.
>>> Can you elaborate on the issue?
>> This is the bug:
>>
>>   vzeroupper use in AVX2 multiarch string functions cause HTM aborts
>>   
>>
>> Unfortunately we have a bug (outside of glibc) that makes me wonder if
>> we can actually roll out RTM transaction checks (or any RTM
>> instruction) on a large scale:
>>
>>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>>   
> It's worth noting that recent microcode updates have make RTM
> considerably less likely to actually work on many parts.  It's
> possible you should just disable it. :(

For a variety of errata and speculative security reasons, hypervisors
now have the ability to hide/show the HLE/RTM CPUID bits, independently
of letting TSX actually work or not.

For migration compatibility reasons, you might quite possibly find
yourself in a VM which advertises the HLE/RTM bits but will
unconditionally abort any transaction.

Honestly, if I were you, I'd just leave it to the user to explicitly opt
in if they want transactions.

~Andrew

[PATCH] x86/cpu: Comment Skylake server stepping too

2021-04-09 Thread Andrew Cooper

Further to commit 53375a5a218e ("x86/cpu: Resort and comment Intel
models"), CascadeLake and CooperLake are steppings of Skylake, and make up
the 1st to 3rd generation "Xeon Scalable Processor" line.

Signed-off-by: Andrew Cooper 
---
CC: Peter Zijlstra 
CC: Tony Luck 
CC: x...@kernel.org
CC: linux-kernel@vger.kernel.org

It is left as an exercise to the reader to ponder why the 3rd generation Xeon
Scalable brand is formed of both CooperLake and IceLake parts.
---
 arch/x86/include/asm/intel-family.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel-family.h 
b/arch/x86/include/asm/intel-family.h
index b15262f1f645..955b06d6325a 100644
--- a/arch/x86/include/asm/intel-family.h
+++ b/arch/x86/include/asm/intel-family.h
@@ -33,7 +33,7 @@
  * _EX - 4+ socket server parts
  *
  * The #define line may optionally include a comment including platform or core
- * names. An exception is made for kabylake where steppings seem to have gotten
+ * names. An exception is made for skylake/kabylake where steppings seem to 
have gotten
  * their own names :-(
  */
 
@@ -74,6 +74,8 @@
 #define INTEL_FAM6_SKYLAKE_L   0x4E/* Sky Lake */
 #define INTEL_FAM6_SKYLAKE 0x5E/* Sky Lake */
 #define INTEL_FAM6_SKYLAKE_X   0x55/* Sky Lake */
+/* CASCADELAKE_X   0x55   Sky Lake -- s: 7 */
+/* COOPERLAKE_X0x55   Sky Lake -- s: 11
*/
 
 #define INTEL_FAM6_KABYLAKE_L  0x8E/* Sky Lake */
 /* AMBERLAKE_L 0x8E   Sky Lake -- s: 9 */
-- 
2.11.0

Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

2020-12-17 Thread Andrew Cooper

On 16/12/2020 08:21, Jürgen Groß wrote:
> On 15.12.20 21:59, Andrew Cooper wrote:
>> On 15/12/2020 11:10, Juergen Gross wrote:
>>> In case a process waits for any Xenstore action in the xenbus driver
>>> it should be interruptible by signals.
>>>
>>> Signed-off-by: Juergen Gross 
>>> ---
>>> V2:
>>> - don't special case SIGKILL as libxenstore is handling -EINTR fine
>>> ---
>>>   drivers/xen/xenbus/xenbus_xs.c | 9 -
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/xenbus/xenbus_xs.c
>>> b/drivers/xen/xenbus/xenbus_xs.c
>>> index 3a06eb699f33..17c8f8a155fd 100644
>>> --- a/drivers/xen/xenbus/xenbus_xs.c
>>> +++ b/drivers/xen/xenbus/xenbus_xs.c
>>> @@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
>>>     static void *read_reply(struct xb_req_data *req)
>>>   {
>>> +    int ret;
>>> +
>>>   do {
>>> -    wait_event(req->wq, test_reply(req));
>>> +    ret = wait_event_interruptible(req->wq, test_reply(req));
>>> +
>>> +    if (ret == -ERESTARTSYS && signal_pending(current)) {
>>> +    req->msg.type = XS_ERROR;
>>> +    return ERR_PTR(-EINTR);
>>> +    }
>>
>> So now I can talk fully about the situations which lead to this, I think
>> there is a bit more complexity.
>>
>> It turns out there are a number of issues related to running a Xen
>> system with no xenstored.
>>
>> 1) If a xenstore-write occurs during startup before init-xenstore-domain
>> runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
>> reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
>> tell the dom0 kernel that xenstored is in dom1.  This effectively
>> deadlocks the system.
>
> This should be easy to solve: any request to /dev/xen/xenbus should
> block upfront in case xenstored isn't up yet (could e.g. wait
> interruptible until xenstored_ready is non-zero).

I'm not sure that that would fix the problem.  The problem is that
setting the ring details via /dev/xen/xenbus_backend blocks, which
prevents us launching the xenstored stubdomain, which prevents the
earlier xenbus write being completed.

So long as /dev/xen/xenbus_backend doesn't block, there's no problem
with other /dev/xen/xenbus activity being pending briefly.


Looking at the current logic, I'm not completely convinced.  Even
finding a filled-in evtchn/gfn doesn't mean that xenstored is actually
ready.

There are 3 possible cases.

1) PV guest, and details in start_info
2) HVM guest, and details in HVM_PARAMs
3) No details (expected for dom0).  Something in userspace must provide
details at a later point.

So the setup phases go from nothing, to having ring details, to finding
the ring working.

I think it would be prudent to try reading a key between having details
and declaring the xenstored_ready.  Any activity, even XS_ERROR,
indicates that the other end of the ring is listening.

>
>> 2) If xenstore-watch is running when xenstored dies, it spins at 100%
>> cpu usage making no system calls at all.  This is caused by bad error
>> handling from xs_watch(), and attempting to debug found:
>
> Can you expand on "bad error handling from xs_watch()", please?

do_watch() has

    for ( ... ) { // defaults to an infinite loop
        vec = xs_read_watch();
        if (vec == NULL)
            continue;
        ...
    }


My next plan was to experiment with break instead of continue, which
I'll get to at some point.

>
>>
>> 3) (this issue).  If anyone starts xenstore-watch with no xenstored
>> running at all, it blocks in D in the kernel.
>
> Should be handled with solution for 1).
>
>>
>> The cause is the special handling for watch/unwatch commands which,
>> instead of just queuing up the data for xenstore, explicitly waits for
>> an OK for registering the watch.  This causes a write() system call to
>> block waiting for a non-existent entity to reply.
>>
>> So while this patch does resolve the major usability issue I found (I
>> can't even SIGINT and get my terminal back), I think there are issues.
>>
>> The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
>> require special handling.  The main kernel thread for processing
>> incoming data from xenstored does need to know how to associate each
>> async XS_WATCH_EVENT to the caller who watched the path.
>>
>> Therefore, depending on when this cancellation hits, we might be in any
>> of the following states:

Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper

On 06/02/2021 10:49, Juergen Gross wrote:
> The ring buffer for user events is used in the local system only, so
> smp barriers are fine for ensuring consistency.
>
> Reported-by: Andrew Cooper 
> Signed-off-by: Juergen Gross 

These need to be virt_* to not break in UP builds (on non-x86).

~Andrew

Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper

On 08/02/2021 09:50, Jan Beulich wrote:
> On 08.02.2021 10:44, Andrew Cooper wrote:
>> On 06/02/2021 10:49, Juergen Gross wrote:
>>> The ring buffer for user events is used in the local system only, so
>>> smp barriers are fine for ensuring consistency.
>>>
>>> Reported-by: Andrew Cooper 
>>> Signed-off-by: Juergen Gross 
>> These need to be virt_* to not break in UP builds (on non-x86).
> Initially I though so, too, but isn't the sole vCPU of such a
> VM getting re-scheduled to a different pCPU in the hypervisor
> an implied barrier anyway?

Yes, but that isn't relevant to why UP builds break.

smp_*() degrade to compiler barriers in UP builds, and while that's
mostly fine for x86 read/write, its not fine for ARM barriers.

virt_*() exist specifically to be smp_*() which don't degrade to broken
in UP builds.

~Andrew

Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper

On 08/02/2021 10:25, Jürgen Groß wrote:
> On 08.02.21 11:23, Andrew Cooper wrote:
>> On 08/02/2021 09:50, Jan Beulich wrote:
>>> On 08.02.2021 10:44, Andrew Cooper wrote:
>>>> On 06/02/2021 10:49, Juergen Gross wrote:
>>>>> The ring buffer for user events is used in the local system only, so
>>>>> smp barriers are fine for ensuring consistency.
>>>>>
>>>>> Reported-by: Andrew Cooper 
>>>>> Signed-off-by: Juergen Gross 
>>>> These need to be virt_* to not break in UP builds (on non-x86).
>>> Initially I though so, too, but isn't the sole vCPU of such a
>>> VM getting re-scheduled to a different pCPU in the hypervisor
>>> an implied barrier anyway?
>>
>> Yes, but that isn't relevant to why UP builds break.
>>
>> smp_*() degrade to compiler barriers in UP builds, and while that's
>> mostly fine for x86 read/write, its not fine for ARM barriers.
>>
>> virt_*() exist specifically to be smp_*() which don't degrade to broken
>> in UP builds.
>
> But the barrier is really only necessary to serialize accesses within
> the guest against each other. There is no guest outside party involved.
>
> In case you are right this would mean that UP guests are all broken on
> Arm.

Oh - right.  This is a ring between the interrupt handler and a task. 
Not a ring between the guest and something else.

In which case smp_*() are correct.  Sorry for the noise.

~Andrew

Re: [PATCH 6/7] xen/evtch: use smp barriers for user event ring

2021-02-08 Thread Andrew Cooper

On 08/02/2021 10:36, Jan Beulich wrote:
> On 08.02.2021 11:23, Andrew Cooper wrote:
>> On 08/02/2021 09:50, Jan Beulich wrote:
>>> On 08.02.2021 10:44, Andrew Cooper wrote:
>>>> On 06/02/2021 10:49, Juergen Gross wrote:
>>>>> The ring buffer for user events is used in the local system only, so
>>>>> smp barriers are fine for ensuring consistency.
>>>>>
>>>>> Reported-by: Andrew Cooper 
>>>>> Signed-off-by: Juergen Gross 
>>>> These need to be virt_* to not break in UP builds (on non-x86).
>>> Initially I though so, too, but isn't the sole vCPU of such a
>>> VM getting re-scheduled to a different pCPU in the hypervisor
>>> an implied barrier anyway?
>> Yes, but that isn't relevant to why UP builds break.
>>
>> smp_*() degrade to compiler barriers in UP builds, and while that's
>> mostly fine for x86 read/write, its not fine for ARM barriers.
> Hmm, I may not know enough of Arm's memory model - are you saying
> Arm CPUs aren't even self-coherent, i.e. later operations (e.g.
> the consuming of ring contents) won't observe earlier ones (the
> updating of ring contents) when only a single physical CPU is
> involved in all of this? (I did mention the hypervisor level
> context switch simply because that's the only way multiple CPUs
> can get involved.)

In this case, no - see my later reply.  I'd mistaken the two ends of
this ring.  As they're both inside the same guest, its fine.

For cases such as the xenstore/console ring, the semantics required
really are SMP, even if the guest is built UP.  These cases really will
break when smp_rmb() etc degrade to just a compiler barrier on
architectures with weaker semantics than x86.

~Andrew

Re: [PATCH] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-25 Thread Andrew Cooper

On 25/01/2021 14:00, Juergen Gross wrote:
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 4409306364dc..82948251f57b 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -583,6 +583,14 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>   exc_debug(regs);
>  }
>  
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +DEFINE_IDTENTRY_RAW(xenpv_exc_vmm_communication)
> +{
> + /* This should never happen and there is no way to handle it. */
> + panic("X86_TRAP_VC in Xen PV mode.");

Honestly, exactly the same is true of #VE, #HV and #SX.

What we do in the hypervisor is wire up one handler for all unknown
exceptions (to avoid potential future #DF issues) leading to a panic. 
Wouldn't it be better to do this unconditionally, especially as #GP/#NP
doesn't work for PV guests for unregistered callbacks, rather than
fixing up piecewise like this?

~Andrew

Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Andrew Cooper

On 27/01/2021 09:38, Juergen Gross wrote:
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 4409306364dc..ca5ac10fcbf7 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>   exc_debug(regs);
>  }
>  
> +DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)
> +{
> + /* This should never happen and there is no way to handle it. */
> + panic("Unknown trap in Xen PV mode.");

Looks much better.  How about including regs->entry_vector here, just to
short circuit the inevitable swearing which will accompany encountering
this panic() ?

~Andrew

Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Andrew Cooper

On 27/01/2021 10:26, Jürgen Groß wrote:
> On 27.01.21 10:43, Andrew Cooper wrote:
>> On 27/01/2021 09:38, Juergen Gross wrote:
>>> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
>>> index 4409306364dc..ca5ac10fcbf7 100644
>>> --- a/arch/x86/xen/enlighten_pv.c
>>> +++ b/arch/x86/xen/enlighten_pv.c
>>> @@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>>>   exc_debug(regs);
>>>   }
>>>   +DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)
>>> +{
>>> +    /* This should never happen and there is no way to handle it. */
>>> +    panic("Unknown trap in Xen PV mode.");
>>
>> Looks much better.  How about including regs->entry_vector here, just to
>> short circuit the inevitable swearing which will accompany encountering
>> this panic() ?
>
> You are aware the regs parameter is struct pt_regs *, not the Xen
> struct cpu_user_regs *?

Yes, but I was assuming that they both contained the same information.

>
> So I have no idea how I should get this information without creating
> a per-vector handler.

Oh - that's dull.

Fine then.  Reviewed-by: Andrew Cooper

Re: [PATCH v2] x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled

2021-01-27 Thread Andrew Cooper

On 27/01/2021 13:59, Jürgen Groß wrote:
> On 27.01.21 12:23, Andrew Cooper wrote:
>> On 27/01/2021 10:26, Jürgen Groß wrote:
>>> On 27.01.21 10:43, Andrew Cooper wrote:
>>>> On 27/01/2021 09:38, Juergen Gross wrote:
>>>>> diff --git a/arch/x86/xen/enlighten_pv.c
>>>>> b/arch/x86/xen/enlighten_pv.c
>>>>> index 4409306364dc..ca5ac10fcbf7 100644
>>>>> --- a/arch/x86/xen/enlighten_pv.c
>>>>> +++ b/arch/x86/xen/enlighten_pv.c
>>>>> @@ -583,6 +583,12 @@ DEFINE_IDTENTRY_RAW(xenpv_exc_debug)
>>>>>    exc_debug(regs);
>>>>>    }
>>>>>    +DEFINE_IDTENTRY_RAW(exc_xen_unknown_trap)
>>>>> +{
>>>>> +    /* This should never happen and there is no way to handle it. */
>>>>> +    panic("Unknown trap in Xen PV mode.");
>>>>
>>>> Looks much better.  How about including regs->entry_vector here,
>>>> just to
>>>> short circuit the inevitable swearing which will accompany
>>>> encountering
>>>> this panic() ?
>>>
>>> You are aware the regs parameter is struct pt_regs *, not the Xen
>>> struct cpu_user_regs *?
>>
>> Yes, but I was assuming that they both contained the same information.
>>
>>>
>>> So I have no idea how I should get this information without creating
>>> a per-vector handler.
>>
>> Oh - that's dull.
>>
>> Fine then.  Reviewed-by: Andrew Cooper 
>>
>
> I think I'll switch the panic() to printk(); BUG(); in order to have
> more diagnostic data. Can I keep your R-b: with that modification?

Yeah.  Sounds good.

~Andrew

Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

2020-12-15 Thread Andrew Cooper

On 15/12/2020 11:10, Juergen Gross wrote:
> In case a process waits for any Xenstore action in the xenbus driver
> it should be interruptible by signals.
>
> Signed-off-by: Juergen Gross 
> ---
> V2:
> - don't special case SIGKILL as libxenstore is handling -EINTR fine
> ---
>  drivers/xen/xenbus/xenbus_xs.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
> index 3a06eb699f33..17c8f8a155fd 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
>  
>  static void *read_reply(struct xb_req_data *req)
>  {
> + int ret;
> +
>   do {
> - wait_event(req->wq, test_reply(req));
> + ret = wait_event_interruptible(req->wq, test_reply(req));
> +
> + if (ret == -ERESTARTSYS && signal_pending(current)) {
> + req->msg.type = XS_ERROR;
> + return ERR_PTR(-EINTR);
> + }

So now I can talk fully about the situations which lead to this, I think
there is a bit more complexity.

It turns out there are a number of issues related to running a Xen
system with no xenstored.

1) If a xenstore-write occurs during startup before init-xenstore-domain
runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
tell the dom0 kernel that xenstored is in dom1.  This effectively
deadlocks the system.

2) If xenstore-watch is running when xenstored dies, it spins at 100%
cpu usage making no system calls at all.  This is caused by bad error
handling from xs_watch(), and attempting to debug found:

3) (this issue).  If anyone starts xenstore-watch with no xenstored
running at all, it blocks in D in the kernel.

The cause is the special handling for watch/unwatch commands which,
instead of just queuing up the data for xenstore, explicitly waits for
an OK for registering the watch.  This causes a write() system call to
block waiting for a non-existent entity to reply.

So while this patch does resolve the major usability issue I found (I
can't even SIGINT and get my terminal back), I think there are issues.

The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
require special handling.  The main kernel thread for processing
incoming data from xenstored does need to know how to associate each
async XS_WATCH_EVENT to the caller who watched the path.

Therefore, depending on when this cancellation hits, we might be in any
of the following states:

1) the watch is queued in the kernel, but not even sent to xenstored yet
2) the watch is queued in the xenstored ring, but not acted upon
3) the watch is queued in the xenstored ring, and the xenstored has seen
it but not replied yet
4) the watch has been processed, but the XS_WATCH reply hasn't been
received yet
5) the watch has been processed, and the XS_WATCH reply received

State 5 (and a little bit) is the normal success path when xenstored has
acted upon the request, and the internal kernel infrastructure is set up
appropriately to handle XS_WATCH_EVENTs.

States 1 and 2 can be very common if there is no xenstored (or at least,
it hasn't started up yet).  In reality, there is either no xenstored, or
it is up and running (and for a period of time during system startup,
these cases occur in sequence).

As soon as the XS_WATCH event has been written into the xenstored ring,
it is not safe to cancel.  You've committed to xenstored processing the
request (if it is up).

If xenstored is actually up and running, its fine and necessary to
block.  The request will be processed in due course (timing subject to
the client and server load).  If xenstored isn't up, blocking isn't ok.

Therefore, I think we need to distinguish "not yet on the ring" from "on
the ring", as our distinction as to whether cancelling is safe, and
ensure we don't queue anything on the ring before we're sure xenstored
has started up.

Does this make sense?

~Andrew

Re: [PATCH] x86/cpu: correct values for GDT_ENTRY_INIT

2020-11-26 Thread Andrew Cooper

On 26/11/2020 11:54, Lukas Bulwahn wrote:
> Commit 1e5de18278e6 ("x86: Introduce GDT_ENTRY_INIT()") unintentionally
> transformed a few 0x values to 0xf (note: five times "f" instead of
> four) as part of the refactoring.

The transformation in that change is correct.

Segment bases are 20 bits wide in x86, but the top nibble is folded into
the middle of the attributes, which is why the transformation also has
xfxx => x0xx for the attributes field.

>
> A quick check with:
>
>   git show 1e5de18278e6 | grep "f"
>
> reveals all those 14 occurrences:
>
> 12 in ./arch/x86/kernel/cpu/common.c, and
> 2  in ./arch/x86/include/asm/lguest.h.
>
> The two occurrences in ./arch/x86/include/asm/lguest.h were deleted with
> commit ecda85e70277 ("x86/lguest: Remove lguest support").
> Correct the remaining twelve occurrences in ./arch/x86/kernel/cpu/common.c
> back to the original values in the source code before the refactoring.
>
> Commit 866b556efa12 ("x86/head/64: Install startup GDT") probably simply
> copied the required startup gdt information from
> ./arch/x86/kernel/cpu/common.c to ./arch/x86/kernel/head64.c.
> So, correct those three occurrences in ./arch/x86/kernel/head64.c as well.
>
> As this value is truncated anyway, the object code has not changed when
> introducing the mistake and is also not changed with this correction now.
>
> This was discovered with sparse, which warns with:
>
>   warning: cast truncates bits from constant value (f becomes )

Does:

diff --git a/arch/x86/include/asm/desc_defs.h
b/arch/x86/include/asm/desc_defs.h
index f7e7099af595..9561f3c66e9e 100644
--- a/arch/x86/include/asm/desc_defs.h
+++ b/arch/x86/include/asm/desc_defs.h
@@ -22,7 +22,7 @@ struct desc_struct {
 
 #define GDT_ENTRY_INIT(flags, base, limit) \
    {   \
-   .limit0 = (u16) (limit),    \
+   .limit0 = (u16) (limit) & 0x,   \
    .limit1 = ((limit) >> 16) & 0x0F,   \
    .base0  = (u16) (base), \
    .base1  = ((base) >> 16) & 0xFF,    \

fix the warning?

Changing the limit from 4G to 128M isn't going to be compatible with a
32bit kernel trying to boot :).

~Andrew

Re: [PATCH] x86/cpu: correct values for GDT_ENTRY_INIT

2020-11-26 Thread Andrew Cooper

On 26/11/2020 19:15, Andy Lutomirski wrote:
> On Thu, Nov 26, 2020 at 11:07 AM Lukas Bulwahn  
> wrote:
>> On Thu, Nov 26, 2020 at 6:16 PM Andrew Cooper  
>> wrote:
>>> On 26/11/2020 11:54, Lukas Bulwahn wrote:
>>>> Commit 1e5de18278e6 ("x86: Introduce GDT_ENTRY_INIT()") unintentionally
>>>> transformed a few 0x values to 0xf (note: five times "f" instead of
>>>> four) as part of the refactoring.
>>> The transformation in that change is correct.
>>>
>>> Segment bases are 20 bits wide in x86,

I of course meant segment limits here, rather than bases.

>>> Does:
>>>
>>> diff --git a/arch/x86/include/asm/desc_defs.h
>>> b/arch/x86/include/asm/desc_defs.h
>>> index f7e7099af595..9561f3c66e9e 100644
>>> --- a/arch/x86/include/asm/desc_defs.h
>>> +++ b/arch/x86/include/asm/desc_defs.h
>>> @@ -22,7 +22,7 @@ struct desc_struct {
>>>
>>>  #define GDT_ENTRY_INIT(flags, base, limit) \
>>> {   \
>>> -   .limit0 = (u16) (limit),\
>>> +   .limit0 = (u16) (limit) & 0x,   \
>>> .limit1 = ((limit) >> 16) & 0x0F,   \
>>> .base0  = (u16) (base), \
>>> .base1  = ((base) >> 16) & 0xFF,\
>>>
>>> fix the warning?
>>>
>> Thanks, I will try that out, and try compiling a 32-bit kernel as well.
> You should also try comparing the objdump output before and after your
> patch.  objdump -D will produce bizarre output but should work.

Expanding on this a little, if that does indeed fix the sparse warning,
then I'd make an argument for this being a bug in sparse.  Explicitly
casting to u16 is semantically and intentionally identical to & 0x.

~Andrew

Re: [Xen-devel] [PATCH v4 3/5] xen: Put EFI machinery in place

2014-05-19 Thread Andrew Cooper

On 16/05/14 21:41, Daniel Kiper wrote:
> @@ -0,0 +1,374 @@
> +/*
> + * EFI support for Xen.
> + *
> + * Copyright (C) 1999 VA Linux Systems
> + * Copyright (C) 1999 Walt Drummond 
> + * Copyright (C) 1999-2002 Hewlett-Packard Co.
> + *   David Mosberger-Tang 
> + *   Stephane Eranian 
> + * Copyright (C) 2005-2008 Intel Co.
> + *   Fenghua Yu 
> + *   Bibo Mao 
> + *   Chandramouli Narayanan 
> + *   Huang Ying 
> + * Copyright (C) 2011 Novell Co.
> + *   Jan Beulich 
> + * Copyright (C) 2011-2012 Oracle Co.
> + *   Liang Tang 
> + * Copyright (c) 2014 Daniel Kiper, Oracle Corporation
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#include 
> +
> +#define call (op.u.efi_runtime_call)

I think not.

> +#define DECLARE_CALL(what) \
> + struct xen_platform_op op; \
> + op.cmd = XENPF_efi_runtime_call; \
> + call.function = XEN_EFI_##what; \
> + call.misc = 0

Macros like this which explicitly create local variable with implicit
names are evil when reading code.

If you want to do initialisation like this, then at least do something like:

#define INIT_EFI_CALL(what) \
{ .cmd = XENPF_efi_runtime_call, \
   .u.efi_runtime_call.function = XEN_EFI_##what, \
   .u.efi_runtime_call.misc = 0 }

And use it as:

struct xen_platform_op op = INIT_EFI_CALL(foo);

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH] x86/xen: Fix 32-bit PV guests's usage of kernel_stack

2014-04-09 Thread Andrew Cooper

On 09/04/14 15:29, David Vrabel wrote:
> On 09/04/14 15:21, Jan Beulich wrote:
> On 09.04.14 at 16:06,  wrote:
>>> --- a/arch/x86/xen/xen-asm_32.S
>>> +++ b/arch/x86/xen/xen-asm_32.S
>>> @@ -88,7 +88,11 @@ ENTRY(xen_iret)
>>>  * avoid having to reload %fs
>>>  */
>>>  #ifdef CONFIG_SMP
>>> +   pushw %fs
>>> +   movl $(__KERNEL_PERCPU), %eax
>>> +   movl %eax, %fs
>>> GET_THREAD_INFO(%eax)
>>> +   popw %fs
>> I don't think it's guaranteed that this can't fault.
> If loading %fs faults when it is restored previously, the fixup zeros
> the value.  However, this later load could still fault even if the first
> succeeded.
>
> Suggest copying the fixup section from the RESTORE_REGS macros in
> arch/x86/kernel/entry_32.S
>
> David

If loading __KERNEL_PERCPU info fs faults, the kernel has bigger
problems to worry about.

The latter load however can easy fault; The arguments for %ds in 
XSA-42/ CVE-2013-0228 applies to %{e,f,g}s as well.

Furthermore, I am a little concerned about the performance impact of
this.  I would have thought that in most cases, %fs will already be
correct, at which point reloading it twice is a waste of time.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v3 3/5] x86: Call efi_memblock_x86_reserve_range() on native EFI platform only

2014-03-26 Thread Andrew Cooper

On 26/03/2014 22:01, Daniel Kiper wrote:
> On Wed, Mar 26, 2014 at 01:57:23PM +, Matt Fleming wrote:
>> On Wed, 26 Mar, at 02:48:45PM, Daniel Kiper wrote:
>>> On my machine this function crashes on Xen so that is why I have changed
>>> condition. However, if you say that this issue could be solved in
>>> another way I will investigate it further.
>> Daniel, could you paste the crash? Do you get a stack trace?
> Here it is:
>
> [...]
>
> mapping kernel into physical memory
> about to get started...
> (XEN) traps.c:458:d0v0 Unhandled divide error fault/trap [#0] on VCPU 0 
> [ec=]
> (XEN) domain_crash_sync called from entry.S: fault at 82d080229d30 
> int80_direct_trap+0x200/0x210
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) [ Xen-4.5-unstable  x86_64  debug=y  Tainted:C ]
> (XEN) CPU:0
> (XEN) RIP:e033:[]
> (XEN) RFLAGS: 0246   EM: 1   CONTEXT: pv guest
> (XEN) rax:    rbx: 0100   rcx: 
> (XEN) rdx:    rsi: 814d7e7e   rdi: 
> (XEN) rbp: 81601e88   rsp: 81601e48   r8:  
> (XEN) r9:  7ff0   r10: deadbeef   r11: 81601df8
> (XEN) r12: 816fe010   r13: 81601f00   r14: 
> (XEN) r15:    cr0: 80050033   cr4: 26f0
> (XEN) cr3: 7b60d000   cr2: 
> (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
> (XEN) Guest stack trace from rsp=81601e48:
> (XEN) 81601df8 816987c5 0001e030
> (XEN)00010046 81601e88 e02b 81601f00
> (XEN)81601ed8 8168a1a4 81601ee8 81601ea8
> (XEN)880001e16490  816fe010 
> (XEN)  81601f28 81685a3e
> (XEN)   
> (XEN)880001e16000   
> (XEN)81601f38 816854c8 81601ff8 81687442
> (XEN)0001 08000623 0789cbf580802001 03010032
> (XEN)0005   
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN)   0f0060c0c748
> (XEN)c305   
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN)   
> (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
>
> Some addresses are solved to:
> 0x816987c5: efi_memblock_x86_reserve_range at 
> arch/x86/platform/efi/efi.c:393
> 0x8168a1a4: setup_arch at arch/x86/kernel/setup.c:940
> 0x81685a3e: setup_command_line at init/main.c:353
> 0x816854c8: x86_64_start_reservations at arch/x86/kernel/head64.c:194
> 0x81687442: xen_start_kernel at arch/x86/xen/enlighten.c:1733
>
> I am using Linus tree with latest commit 
> b098d6726bbfb94c06d6e1097466187afddae61f
> (Linux 3.14-rc8) with my patches applied excluding patch 3.
>
> Daniel

Then all you need to do is look up 816987c5 in your linux
symbols, and whichever variable is being divided on that line of source
has ether has the value 0 or -1. 

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v3 4/7] xen/pciback: Implement PCI reset slot or bus with 'do_flr' SysFS attribute

2014-07-08 Thread Andrew Cooper

On 08/07/14 19:58, kon...@kernel.org wrote:
> From: Konrad Rzeszutek Wilk 
>
> The life-cycle of a PCI device in Xen pciback is complex
> and is constrained by the PCI generic locking mechanism.
>
> It starts with the device being binded to us - for which

"being bound to us"

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v3 1/7] xen-pciback: Document the various parameters and attributes in SysFS

2014-07-08 Thread Andrew Cooper

On 08/07/14 19:58, kon...@kernel.org wrote:
> From: Konrad Rzeszutek Wilk 
>
> Which hadn't been done with the initial commit.
>
> Signed-off-by: Konrad Rzeszutek Wilk 
> ---
>  Documentation/ABI/testing/sysfs-driver-pciback |   84 
> 
>  1 files changed, 84 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-driver-pciback
>
> diff --git a/Documentation/ABI/testing/sysfs-driver-pciback 
> b/Documentation/ABI/testing/sysfs-driver-pciback
> new file mode 100644
> index 000..4d2860c
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-driver-pciback
> @@ -0,0 +1,84 @@
> +What:   /sys//module/xen_pciback/parameters/verbose_request
>

s_//_/_ ?

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v3 1/7] xen-pciback: Document the various parameters and attributes in SysFS

2014-07-09 Thread Andrew Cooper

On 09/07/14 14:59, Konrad Rzeszutek Wilk wrote:
>
>>> +What:   /sys/bus/pci/drivers/pciback/irq_handler_state
>>> +Date:   Oct 2011
>>> +KernelVersion:  3.1
>>> +Contact:xen-de...@lists.xenproject.org
>>> +Description:
>>> +An option to toggle Xen PCI back to acknowledge (or stop)
>>> +interrupts for the specific device regardless of whether 
>>> the
>>> +device is shared, enabled, or on a level interrupt line.
>>> +Writing a string of :BB:DD.F will toggle the state.
>>> +This is Domain:Bus:Device.Function where domain is 
>>> optional.
>> I do not understand under what circumstances this should be used in.
> So that dom0 does not disable the IRQ line as it would be getting the IRQs
> for the guest as well (because the IRQ line is level and another guest
> uses an PCI device that is using the same line).

Why is this relevant?  Xen (and Xen alone) actually controls this aspect
of interrupts.  Xen manages passing line level interrupts to any domain
which might have a device hanging off a particular line, and has to wait
until all domains have EOI'd the line until it can clear the interrupt
at the IO-APIC.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH v3 1/7] xen-pciback: Document the various parameters and attributes in SysFS

2014-07-09 Thread Andrew Cooper

On 09/07/14 15:13, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 09, 2014 at 03:05:56PM +0100, Andrew Cooper wrote:
>> On 09/07/14 14:59, Konrad Rzeszutek Wilk wrote:
>>>>> +What:   /sys/bus/pci/drivers/pciback/irq_handler_state
>>>>> +Date:   Oct 2011
>>>>> +KernelVersion:  3.1
>>>>> +Contact:xen-de...@lists.xenproject.org
>>>>> +Description:
>>>>> +An option to toggle Xen PCI back to acknowledge (or stop)
>>>>> +interrupts for the specific device regardless of whether 
>>>>> the
>>>>> +device is shared, enabled, or on a level interrupt line.
>>>>> +Writing a string of :BB:DD.F will toggle the state.
>>>>> +This is Domain:Bus:Device.Function where domain is 
>>>>> optional.
>>>> I do not understand under what circumstances this should be used in.
>>> So that dom0 does not disable the IRQ line as it would be getting the IRQs
>>> for the guest as well (because the IRQ line is level and another guest
>>> uses an PCI device that is using the same line).
>> Why is this relevant?  Xen (and Xen alone) actually controls this aspect
>> of interrupts.  Xen manages passing line level interrupts to any domain
>> which might have a device hanging off a particular line, and has to wait
>> until all domains have EOI'd the line until it can clear the interrupt
>> at the IO-APIC.
> Because Linux will think there is an IRQ storm as the event->IRQ points
> to the default one. And then it will mask the event, which means dom0
> will mask the PIRQ, and Xen will then also mask the IRQ.

Xen will (and by this I mean 'should', and this was the behaviour last
time I delved in there) only mask the IRQ if dom0 is the only consumer
of these interrupts.

For any PCIPassthrough, dom0 will get line interrupts for passed-through
devices, but in this case pci-back should always handle the line
interrupts so Linux doesn't block them as an IRQ storm.

~Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 315 matches

Mail list logo