exec: Move TLB_MMIO, TLB_DISCARD_WRITE to slow flags

Jonathan Cameron via Thu, 08 May 2025 06:30:39 -0700

On Tue, 29 Apr 2025 19:43:05 -0700
Richard Henderson <richard.hender...@linaro.org> wrote:


> On 4/29/25 14:35, Alistair Francis wrote:
> > On Sat, Apr 26, 2025 at 3:36 AM Jonathan Cameron via
> > <qemu-devel@nongnu.org> wrote:  
> >>
> >> On Tue, 22 Apr 2025 12:26:55 -0700
> >> Richard Henderson <richard.hender...@linaro.org> wrote:
> >>  
> >>> Recover two bits from the inline flags.  
> >>
> >>
> >> Hi Richard,
> >>
> >> Early days but something (I'm fairly sure in this patch) is tripping up my 
> >> favourite
> >> TCG corner case of running code out of MMIO memory (interleaved CXL 
> >> memory).
> >>
> >> Only seeing it on arm64 tests so far which isn't upstream yet..
> >> (guess what I was getting ready to post today)
> >>
> >> Back trace is:
> >>
> >> #0  0x0000555555fd4296 in cpu_atomic_fetch_andq_le_mmu 
> >> (env=0x555557ee19b0, addr=18442241572520067072, val=18446744073701163007, 
> >> oi=8244, retaddr=<optimized out>) at ../../accel/tcg/atomic_template.h:140
> >> #1  0x00007fffb6894125 in code_gen_buffer ()
> >> #2  0x0000555555fc4c46 in cpu_tb_exec (cpu=cpu@entry=0x555557ededf0, 
> >> itb=itb@entry=0x7fffb6894000 <code_gen_buffer+200511443>, 
> >> tb_exit=tb_exit@entry=0x7ffff4bfb744) at ../../accel/tcg/cpu-exec.c:455
> >> #3  0x0000555555fc51c2 in cpu_loop_exec_tb (tb_exit=0x7ffff4bfb744, 
> >> last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffb6894000 
> >> <code_gen_buffer+200511443>, cpu=0x555557ededf0) at 
> >> ../../accel/tcg/cpu-exec.c:904
> >> #4  cpu_exec_loop (cpu=cpu@entry=0x555557ededf0, 
> >> sc=sc@entry=0x7ffff4bfb7f0) at ../../accel/tcg/cpu-exec.c:1018
> >> #5  0x0000555555fc58f1 in cpu_exec_setjmp (cpu=cpu@entry=0x555557ededf0, 
> >> sc=sc@entry=0x7ffff4bfb7f0) at ../../accel/tcg/cpu-exec.c:1035
> >> #6  0x0000555555fc5f6c in cpu_exec (cpu=cpu@entry=0x555557ededf0) at 
> >> ../../accel/tcg/cpu-exec.c:1061
> >> #7  0x0000555556146ac3 in tcg_cpu_exec (cpu=cpu@entry=0x555557ededf0) at 
> >> ../../accel/tcg/tcg-accel-ops.c:81
> >> #8  0x0000555556146ee3 in mttcg_cpu_thread_fn 
> >> (arg=arg@entry=0x555557ededf0) at ../../accel/tcg/tcg-accel-ops-mttcg.c:94
> >> #9  0x00005555561f6450 in qemu_thread_start (args=0x555557f8f430) at 
> >> ../../util/qemu-thread-posix.c:541
> >> #10 0x00007ffff7750aa4 in start_thread (arg=<optimized out>) at 
> >> ./nptl/pthread_create.c:447
> >> #11 0x00007ffff77ddc3c in clone3 () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
> >>
> >> I haven't pushed out the rebased tree yet making this a truly awful bug 
> >> report.
> >>
> >> The pull request you sent with this in wasn't bisectable so this was a bit 
> >> of a guessing
> >> game. I see the seg fault only after this patch.  
> > 
> > I see the same thing with some RISC-V tests. I can provide the test
> > images if you want as well  
> 
> 
> Yes please.
> 
> 
> r~

I'm guessing Alastair is busy.

I got around to testing this on x86 and indeed blow up is the same.

0x0000555555e3dd77 in cpu_atomic_add_fetchl_le_mmu (env=0x55555736bef0, 
addr=140271756837240, val=1, oi=34, retaddr=<optimized out>) at 
../../accel/tcg/atomic_template.h:143
143     GEN_ATOMIC_HELPER(add_fetch)
(gdb) bt
#0  0x0000555555e3dd77 in cpu_atomic_add_fetchl_le_mmu (env=0x55555736bef0, 
addr=140271756837240, val=1, oi=34, retaddr=<optimized out>) at 
../../accel/tcg/atomic_template.h:143
#1  0x00007fffbc31c6f0 in code_gen_buffer ()
#2  0x0000555555e23aa6 in cpu_tb_exec (cpu=cpu@entry=0x555557369330, 
itb=itb@entry=0x7fffbc31c600 <code_gen_buffer+295441875>, 
tb_exit=tb_exit@entry=0x7ffff4bfd6ec) at ../../accel/tcg/cpu-exec.c:438
#3  0x0000555555e24025 in cpu_loop_exec_tb (tb_exit=0x7ffff4bfd6ec, 
last_tb=<synthetic pointer>, pc=<optimized out>, tb=0x7fffbc31c600 
<code_gen_buffer+295441875>, cpu=0x555557369330) at 
../../accel/tcg/cpu-exec.c:872
#4  cpu_exec_loop (cpu=cpu@entry=0x555557369330, sc=sc@entry=0x7ffff4bfd7b0) at 
../../accel/tcg/cpu-exec.c:982
#5  0x0000555555e247a1 in cpu_exec_setjmp (cpu=cpu@entry=0x555557369330, 
sc=sc@entry=0x7ffff4bfd7b0) at ../../accel/tcg/cpu-exec.c:999
#6  0x0000555555e24e2c in cpu_exec (cpu=cpu@entry=0x555557369330) at 
../../accel/tcg/cpu-exec.c:1025
#7  0x0000555555e42c73 in tcg_cpu_exec (cpu=cpu@entry=0x555557369330) at 
../../accel/tcg/tcg-accel-ops.c:81
#8  0x0000555555e43093 in mttcg_cpu_thread_fn (arg=arg@entry=0x555557369330) at 
../../accel/tcg/tcg-accel-ops-mttcg.c:94
#9  0x0000555555ef2250 in qemu_thread_start (args=0x5555573e6e20) at 
../../util/qemu-thread-posix.c:541
#10 0x00007ffff7750aa4 in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:447
#11 0x00007ffff77ddc3c in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Need one patch for my particular setup to work around some DMA buffer issues in 
virtio (similar to
a patch for pci space last year).  I've been meaning to post an RFC to get 
feedback on how
to handle this but not gotten to it yet!

From 801e47897c5959a22ed050d7e7feebbbd3a12588 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <jonathan.came...@huawei.com>
Date: Mon, 22 Apr 2024 13:54:37 +0100
Subject: [PATCH] physmem: Increase bounce buffers for "memory" address space.

Doesn't need to be this big and should be configurable.

Signed-off-by: Jonathan Cameron <jonathan.came...@huawei.com>
---
 system/physmem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/system/physmem.c b/system/physmem.c
index 3f4fd69d9a..651b875827 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2798,6 +2798,7 @@ static void memory_map_init(void)
     memory_region_init(system_memory, NULL, "system", UINT64_MAX);
     address_space_init(&address_space_memory, system_memory, "memory");
 
+    address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024;
     system_io = g_malloc(sizeof(*system_io));
     memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
                           65536);
-- 
2.43.0


Anyhow, other than that you need any random distro image (I tend to use debian 
nocloud images)
and a recent kernel build (mainline is fine).

Then a config along the lines of (obviously this isn't minimal)

qemu-system-x86_64 -M q35,cxl=on,sata=off,smbus=off -m 4g,maxmem=8G,slots=4 
-cpu max -smp 4 \
                                      -kernel bzImage \
                                      -bios bios \
 -drive if=none,file=/mnt/d/images/x86-full-big.qcow2,format=qcow2,id=hd \
 -device ioh3420,id=root_port1 -device virtio-blk-pci,drive=hd,bus=root_port1 \
 -netdev user,id=mynet,hostfwd=tcp::5553-:22 -device 
virtio-net-pci,netdev=mynet,id=bob \
 -nographic -no-reboot -append 'earlycon console=ttyS0 root=/dev/vda3 
fsck.mode=skip tp_printk maxcpus=4' \
 -monitor telnet:127.0.0.1:1235,server,nowait \
 -object memory-backend-ram,size=4G,id=mem0 \
 -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
 -numa node,nodeid=1 \
 -serial mon:stdio \
  -object 
memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M,align=256M
 \
 -object 
memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M,align=256M
 \
 -object 
memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=1M,align=1M 
\
 -object 
memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M,align=256M
 \
 -object 
memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest4.raw,size=256M,align=256M
 \
 -object 
memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=1M,align=1M
 \
 -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
 -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=2 \
 -device cxl-rp,port=1,bus=cxl.1,id=root_port2,chassis=0,slot=3 \
 -device 
cxl-type3,bus=root_port0,volatile-memdev=cxl-mem1,id=cxl-pmem0,lsa=cxl-lsa1,sn=3
 \
 -device 
cxl-type3,bus=root_port2,volatile-memdev=cxl-mem3,id=cxl-pmem1,lsa=cxl-lsa2,sn=4
 \
 -machine 
cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=1k

Then after booting into linux, bring up a cxl region with:

    cd /sys/bus/cxl/devices/decoder0.0/
    cat create_ram_region
    echo region0 > create_ram_region

    echo ram > /sys/bus/cxl/devices/decoder2.0/mode
    echo $((256 << 20)) > /sys/bus/cxl/devices/decoder2.0/dpa_size

    cd /sys/bus/cxl/devices/region0/
    echo 256 > interleave_granularity
    echo 1 > interleave_ways
    echo $((256 << 20)) > size 
    echo decoder2.0 > target0
    echo 1 > commit
    echo region0 > /sys/bus/cxl/drivers/cxl_region/bind

That should bring up a small amount of memory in node 2. Interleaving isn't 
actually
in use here but we haven't upstreamed the bypass optimizations so this is still
mmio space to QEMU.

Then numactl -m 2 ls 
boom.

A few relevant bits of kernel config (also not minimal)

//dax stuff to ensure we get memory as normal ram.
CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_HMEM=y
CONFIG_DEV_DAX_CXL=m
CONFIG_DEV_DAX_HMEM_DEVICES=y
CONFIG_DEV_DAX_KMEM=m
//memory hotplug
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE=y

Any hints welcome!  Also happy to provide any additional info as necessary.

Jonathan

Re: [PATCH 066/147] include/exec: Move TLB_MMIO, TLB_DISCARD_WRITE to slow flags

Reply via email to