[Xen-devel] [BUG] assertion failure in do_grant_table_op()

2017-11-28 Thread Jann Horn
A fuzzer based on AFL and TriforceAFL discovered an assertion
violation in Xen 4.9.1.

The issue is that, when `opaque_out` is non-zero,
do_grant_table_op() assumes that the hypercall was preempted and a
continuation is generated. However, `opaque_out` also ends up being
non-zero if the guest called GNTTABOP_cache_flush with
`opaque_in != 0` and `count == 0`, in which case there is no more
work to do.

In release builds, this is not an issue: A guest that performs such
a nonsensical hypercall goes into an endless hypercall-calling loop,
which the guest can detect as a soft kernel lockup. This does not
interfere with the normal operation of the hypervisor and does not
even interfere with other tasks running in the guest if the guest
kernel supports preemption.


Reproducer:



root@pv-guest:~/borkmod2# cat borker.c
#include 
#include 

static int __init init_mod(void) {
  asm volatile (
"mov $20, %%rax\n\t" /*__HYPERVISOR_grant_table_op*/
"mov $0x800c, %%rdi\n\t" /*GNTTABOP_cache_flush|0x8000*/
"mov $0, %%rsi\n\t"
"mov $0, %%rdx\n\t"
"syscall\n\t"
  : //out
  : //in
  : //clobber
"cc","memory","rax","rdi","rsi","rdx","rcx","r11"
  );
  return -EINVAL;
}

module_init(init_mod);
root@pv-guest:~/borkmod2# cat Makefile
obj-m := borker.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)

all:
$(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean

root@pv-guest:~/borkmod2# make
make -C /lib/modules/4.9.0-4-amd64/build M=/root/borkmod2 modules
make[1]: Entering directory '/usr/src/linux-headers-4.9.0-4-amd64'
  Building modules, stage 2.
  MODPOST 1 modules
make[1]: Leaving directory '/usr/src/linux-headers-4.9.0-4-amd64'
root@pv-guest:~/borkmod2# insmod borker.ko



Resulting panic on a debug build:



(XEN) Assertion 'rc < count' failed at grant_table.c:3273
(XEN) [ Xen-4.9.1  x86_64  debug=y   Not tainted ]
(XEN) CPU:0
(XEN) RIP:e008:[] do_grant_table_op+0x1e2c/0x2272
(XEN) RFLAGS: 00010246   CONTEXT: hypervisor (d1v0)
(XEN) rax:    rbx: 8300bfc57f18   rcx: 82d080378680
(XEN) rdx: 07ff   rsi:    rdi: 000c
(XEN) rbp: 8300bfc57e68   rsp: 8300bfc57d88   r8:  
(XEN) r9:  deadbeefdeadf00d   r10: 7ff0   r11: 0246
(XEN) r12: 8000   r13: 0014   r14: 
(XEN) r15: 000c   cr0: 80050033   cr4: 001506e4
(XEN) cr3: 00012282d000   cr2: 880014786918
(XEN) fsb: 7fd847e48700   gsb: 880018c0   gss: 
(XEN) ds:    es:    fs:    gs:    ss: e010   cs: e008
(XEN) Xen code around  (do_grant_table_op+0x1e2c/0x2272):
(XEN)  ff ff ff e9 3c f1 ff ff <0f> 0b 0f 0b 48 c7 c0 ea ff ff ff e9 75 f1 ff ff
(XEN) Xen stack trace from rsp=8300bfc57d88:
(XEN)8300bfc57e68 82d08028fb8f 006e 880018c0b8e0
(XEN)000d 81059d42 e033 00011002
(XEN)c9004029fd70 8300bfc57de8 82d0 8058fdd8
(XEN) 7ff0ffea 880018c0c160 0002
(XEN)81059d40 33d8 00011002 000f
(XEN)00122831 880018c182a8 00011002 8300bfc57f18
(XEN)8300bfc22000 0014 82d08021396f deadbeefdeadf00d
(XEN)8300bfc57f08 82d08035d14f 0300 800c
(XEN)  deadbeefdeadf00d deadbeefdeadf00d
(XEN)   
(XEN) 0001  8300bfc22000
(XEN)88001417b000 880013d46300 c007 c0070050
(XEN)7cff403a80c7 82d080360ff6 880013a15100 880013a156b8
(XEN)880013a15100 880018c18d10 c0096000 
(XEN)0246 7ff0 0013 0001f958
(XEN)0014 c009601e  
(XEN)800c 00010100 c009601e e033
(XEN)0246 c90040817cd8 e02b 0009dfa3
(XEN)003bb8b80001 003bbd140009dfa3 0009dfb0 003bbcec
(XEN)8300bfc22000  001506e4
(XEN) Xen call trace:
(XEN)[] do_grant_table_op+0x1e2c/0x2272
(XEN)[] pv_hypercall+0x150/0x460
(XEN)[] entry.o#test_all_events+0/0x30
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) Assertion 'rc < count' failed at grant_table.c:3273
(XEN) 
(XEN)
(XEN) Reboot in five seconds...

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [BUG] incorrect goto in gnttab_setup_table overdecrements the preemption counter

2017-11-29 Thread Jann Horn
gnttab_setup_table() has the following code:

=
static long
gnttab_setup_table(
XEN_GUEST_HANDLE_PARAM(gnttab_setup_table_t) uop, unsigned int count)
{
struct gnttab_setup_table op;
struct domain *d;
struct grant_table *gt;
inti;
xen_pfn_t  gmfn;

[...]

d = rcu_lock_domain_by_any_id(op.dom);
if ( d == NULL )
{
gdprintk(XENLOG_INFO, "Bad domid %d.\n", op.dom);
op.status = GNTST_bad_domain;
goto out2;
}

[...]
 out2:
rcu_unlock_domain(d);
 out1:
if ( unlikely(__copy_field_to_guest(uop, &op, status)) )
return -EFAULT;

return 0;
}
=

In the case where a bad domain ID is supplied (`d == NULL`),
`rcu_unlock_domain(d)` is executed. `rcu_unlock_domain()` is defined
as follows:

=
static inline void rcu_unlock_domain(struct domain *d)
{
if ( d != current->domain )
rcu_read_unlock(d);
}
#define rcu_read_unlock(x) ({ ((void)(x)); preempt_enable(); })
#define preempt_enable() do {   \
barrier();  \
preempt_count()--;  \
} while (0)
=

This means that the preemption counter is decremented without a
corresponding increment, leaving it at UINT_MAX. In debug builds, this
causes an assertion failure in ASSERT_NOT_IN_ATOMIC on hypercall
return; however, in release builds, it's harmless because the
preemption counter isn't used anywhere outside preempt_enable() and
preempt_disable(), which only increment and decrement it. (There are
some uses of in_atomic() in hvm.c, but those have been disabled with
"#if 0" since at least Xen 4.5.5, the oldest supported release.)


The following code can be used in a 64-bit PV guest to reproduce the bug:

=
root@pv-guest:~/borkmod4# cat borker.c
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int __init init_mod(void) {
  struct gnttab_setup_table args = {
.dom = 0x /* invalid domain ID */
  };
  HYPERVISOR_grant_table_op(GNTTABOP_setup_table, &args, 1);

  return -EINVAL;
}

module_init(init_mod);
MODULE_LICENSE("GPL v2");
root@pv-guest:~/borkmod4# cat Makefile
obj-m := borker.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)

all:
$(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean

root@pv-guest:~/borkmod4# make
make -C /lib/modules/4.9.0-4-amd64/build M=/root/borkmod4 modules
make[1]: Entering directory '/usr/src/linux-headers-4.9.0-4-amd64'
  Building modules, stage 2.
  MODPOST 1 modules
make[1]: Leaving directory '/usr/src/linux-headers-4.9.0-4-amd64'
root@pv-guest:~/borkmod4# insmod borker.ko
=

This results in the following crash in a debug build of Xen 4.9.1:

=
(XEN) grant_table.c:1646:d1v0 Bad domid 61166.
(XEN) Assertion '!preempt_count()' failed at preempt.c:36
(XEN) [ Xen-4.9.1  x86_64  debug=y   Not tainted ]
(XEN) CPU:0
(XEN) RIP:e008:[] ASSERT_NOT_IN_ATOMIC+0x46/0x4c
(XEN) RFLAGS: 00010286   CONTEXT: hypervisor (d1v0)
(XEN) rax: 82d08058ed28   rbx: 8300bfc22000   rcx: 
(XEN) rdx:    rsi: 000a   rdi: c90040c07cc0
(XEN) rbp: 8300bfc57f08   rsp: 8300bfc57f08   r8:  83022d3bc000
(XEN) r9:  0004   r10: 0004   r11: 0005
(XEN) r12: 88001512d780   r13: 880014c98980   r14: c006b000
(XEN) r15: c006b050   cr0: 80050033   cr4: 001506e4
(XEN) cr3: 05e51000   cr2: 880005587058
(XEN) fsb: 7f471797b700   gsb: 880018c0   gss: 
(XEN) ds:    es:    fs:    gs:    ss: e010   cs: e008
(XEN) Xen code around  (ASSERT_NOT_IN_ATOMIC+0x46/0x4c):
(XEN)  58 f6 c4 02 74 06 5d c3 <0f> 0b 0f 0b 0f 0b 55 48 89 e5 48 8d 05 e4 a5 36
(XEN) Xen stack trace from rsp=8300bfc57f08:
(XEN)7cff403a80c7 82d080360ffc 880013a17100 880013a176b8
(XEN)880013a17100 880018c18d10 c0096000 
(XEN)0246 7ff0 003a 0001f958
(XEN) 8100128a deadbeefdeadf00d deadbeefdeadf00d
(XEN)deadbeefdeadf00d 00010100 8100128a e033
(XEN)0246 c90040c07ca0 e02b 0009dfa3
(XEN)003bb8b80001 003bbd140009dfa3 0009dfb0 003bbcec
(XEN)8300bfc22000  001506e4
(XEN) Xen call trace:
(XEN)[] ASSERT_NOT_IN_ATOMIC+0x46/0x4c
(XEN)[] entry.o#test_all_events+0x6/0x30
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) Assertion '!preem

Re: [Xen-devel] [BUG] incorrect goto in gnttab_setup_table overdecrements the preemption counter

2017-11-29 Thread Jann Horn
On Wed, Nov 29, 2017 at 3:32 PM, Andrew Cooper
 wrote:
> On 29/11/17 14:23, Jann Horn wrote:
>> gnttab_setup_table() has the following code:
>>
>> =
>> static long
>> gnttab_setup_table(
>> XEN_GUEST_HANDLE_PARAM(gnttab_setup_table_t) uop, unsigned int count)
>> {
>> struct gnttab_setup_table op;
>> struct domain *d;
>> struct grant_table *gt;
>> inti;
>> xen_pfn_t  gmfn;
>>
>> [...]
>>
>> d = rcu_lock_domain_by_any_id(op.dom);
>> if ( d == NULL )
>> {
>> gdprintk(XENLOG_INFO, "Bad domid %d.\n", op.dom);
>> op.status = GNTST_bad_domain;
>> goto out2;
>> }
>>
>> [...]
>>  out2:
>> rcu_unlock_domain(d);
>>  out1:
>> if ( unlikely(__copy_field_to_guest(uop, &op, status)) )
>> return -EFAULT;
>>
>> return 0;
>> }
>> =
>> 
>>
>> This results in the following crash in a debug build of Xen 4.9.1:
>
> Thanks for the report.
>
> This was fixed in master by
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=5e436e7a45082ea2cadc176c19e1df46c178448f
> but it looks like its not been backported to older releases.

Urgh. I guess I really ought to fuzz master, not releases.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/3] treewide: Lift switch variables out of switches

2019-01-23 Thread Jann Horn
On Wed, Jan 23, 2019 at 1:04 PM Greg KH  wrote:
> On Wed, Jan 23, 2019 at 03:03:47AM -0800, Kees Cook wrote:
> > Variables declared in a switch statement before any case statements
> > cannot be initialized, so move all instances out of the switches.
> > After this, future always-initialized stack variables will work
> > and not throw warnings like this:
> >
> > fs/fcntl.c: In function ‘send_sigio_to_task’:
> > fs/fcntl.c:738:13: warning: statement will never be executed 
> > [-Wswitch-unreachable]
> >siginfo_t si;
> >  ^~
>
> That's a pain, so this means we can't have any new variables in { }
> scope except for at the top of a function?

AFAICS this only applies to switch statements (because they jump to a
case and don't execute stuff at the start of the block), not blocks
after if/while/... .

> That's going to be a hard thing to keep from happening over time, as
> this is valid C :(

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

2025-02-10 Thread Jann Horn
On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider  wrote:
> What if isolated CPUs unconditionally did a TLBi as late as possible in
> the stack right before returning to userspace? This would mean that upon
> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
> range translation - with the exception of whatever lies between the
> last-minute flush and the actual userspace entry, which should be feasible
> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
> could be entirely silenced if it targets an isolated CPU.

Two issues with that:

1. I think the "Common not Private" feature Will Deacon referred to is
incompatible with this idea:

says "When the CnP bit is set, the software promises to use the ASIDs
and VMIDs in the same way on all processors, which allows the TLB
entries that are created by one processor to be used by another"

2. It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to
assume that the CPU might be populating random TLB entries all over
the place.



Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

2025-01-14 Thread Jann Horn
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider  wrote:
> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> flush_tlb_kernel_range() IPIs.
>
> Given that CPUs executing in userspace do not access data in the vmalloc
> range, these IPIs could be deferred until their next kernel entry.
>
> Deferral vs early entry danger zone
> ===
>
> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> and then accessed in early entry code.

In other words, it needs a guarantee that no vmalloc allocations that
have been created in the vmalloc region while the CPU was idle can
then be accessed during early entry, right?



Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

2025-01-17 Thread Jann Horn
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider  wrote:
> On 14/01/25 19:16, Jann Horn wrote:
> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider  
> > wrote:
> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> flush_tlb_kernel_range() IPIs.
> >>
> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> range, these IPIs could be deferred until their next kernel entry.
> >>
> >> Deferral vs early entry danger zone
> >> ===
> >>
> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> and then accessed in early entry code.
> >
> > In other words, it needs a guarantee that no vmalloc allocations that
> > have been created in the vmalloc region while the CPU was idle can
> > then be accessed during early entry, right?
>
> I'm not sure if that would be a problem (not an mm expert, please do
> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> deferred anyway.

flush_cache_vmap() is about stuff like flushing data caches on
architectures with virtually indexed caches; that doesn't do TLB
maintenance. When you look for its definition on x86 or arm64, you'll
see that they use the generic implementation which is simply an empty
inline function.

> So after vmapping something, I wouldn't expect isolated CPUs to have
> invalid TLB entries for the newly vmapped page.
>
> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> stale TLB entries can and will remain on isolated CPUs, up until they
> execute the deferred flush themselves (IOW for the entire duration of the
> "danger zone").
>
> Does that make sense?

The design idea wrt TLB flushes in the vmap code is that you don't do
TLB flushes when you unmap stuff or when you map stuff, because doing
TLB flushes across the entire system on every vmap/vunmap would be a
bit costly; instead you just do batched TLB flushes in between, in
__purge_vmap_area_lazy().

In other words, the basic idea is that you can keep calling vmap() and
vunmap() a bunch of times without ever doing TLB flushes until you run
out of virtual memory in the vmap region; then you do one big TLB
flush, and afterwards you can reuse the free virtual address space for
new allocations again.

So if you "defer" that batched TLB flush for CPUs that are not
currently running in the kernel, I think the consequence is that those
CPUs may end up with incoherent TLB state after a reallocation of the
virtual address space.

Actually, I think this would mean that your optimization is disallowed
at least on arm64 - I'm not sure about the exact wording, but arm64
has a "break before make" rule that forbids conflicting writable
address translations or something like that.

(I said "until you run out of virtual memory in the vmap region", but
that's not actually true - see the comment above lazy_max_pages() for
an explanation of the actual heuristic. You might be able to tune that
a bit if you'd be significantly happier with less frequent
interruptions, or something along those lines.)



Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

2025-03-25 Thread Jann Horn
On Tue, Mar 25, 2025 at 6:52 PM Valentin Schneider  wrote:
> On 20/02/25 09:38, Dave Hansen wrote:
> > But, honestly, I'm still not sure this is worth all the trouble. If
> > folks want to avoid IPIs for TLB flushes, there are hardware features
> > that *DO* that. Just get new hardware instead of adding this complicated
> > pile of software that we have to maintain forever. In 10 years, we'll
> > still have this software *and* 95% of our hardware has the hardware
> > feature too.
>
> Sorry, you're going to have to deal with my ignorance a little bit longer...
>
> Were you thinking x86 hardware specifically, or something else?
> AIUI things like arm64's TLBIVMALLE1IS can do what is required without any
> IPI:
>
> C5.5.78
> """
> The invalidation applies to all PEs in the same Inner Shareable shareability 
> domain as the PE that
> executes this System instruction.
> """
>
> But for (at least) these architectures:
>
>   alpha
>   x86
>   loongarch
>   mips
>   (non-freescale 8xx) powerpc
>   riscv
>   xtensa
>
> flush_tlb_kernel_range() has a path with a hardcoded use of on_each_cpu(),
> so AFAICT for these the IPIs will be sent no matter the hardware.

On X86, both AMD and Intel have some fairly recently introduced CPU
features that can shoot down TLBs remotely.

The patch series

adds support for the AMD flavor; that series landed in the current
merge window (it's present in the mainline git repository now and should
be part of 6.15). I think support for the Intel flavor has not yet
been implemented, but the linked patch series mentions a plan to look
at the Intel flavor next.