On 11/4/24 9:48 AM, Richard Henderson wrote:
On 10/30/24 15:25, Paolo Savini wrote:
Thanks for the review Richard.

On 10/30/24 11:40, Richard Henderson wrote:
On 10/29/24 19:43, Paolo Savini wrote:
This patch optimizes the emulation of unit-stride load/store RVV instructions
when the data being loaded/stored per iteration amounts to 16 bytes or more.
The optimization consists of calling __builtin_memcpy on chunks of data of 16
bytes between the memory address of the simulated vector register and the
destination memory address and vice versa.
This is done only if we have direct access to the RAM of the host machine,
if the host is little endiand and if it supports atomic 128 bit memory
operations.

Signed-off-by: Paolo Savini <paolo.sav...@embecosm.com>
---
  target/riscv/vector_helper.c    | 17 ++++++++++++++++-
  target/riscv/vector_internals.h | 12 ++++++++++++
  2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 75c24653f0..e1c100e907 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -488,7 +488,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, 
uint32_t byte_end,
      }
        fn = fns[is_load][group_size];
-    fn(vd, byte_offset, host + byte_offset);
+
+    /* __builtin_memcpy uses host 16 bytes vector loads and stores if 
supported.
+     * We need to make sure that these instructions have guarantees of 
atomicity.
+     * E.g. x86 processors provide strong guarantees of atomicity for 16-byte
+     * memory operations if the memory operands are 16-byte aligned */
+    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) &&
+            ((byte_offset % 16) == 0) && HOST_128_ATOMIC_MEM_OP) {
+      group_size = MO_128;
+      if (is_load) {
+        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + 
byte_offset), 16);
+      } else {
+        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + 
byte_offset), 16);
+      }

I said this last time and I'll say it again:

    __builtin_memcpy DOES NOT equal VMOVDQA
I am aware of this. I took __builtin_memcpy as a generic enough way to emulate 
loads and stores that should allow several hosts to generate the widest 
load/store instructions they can and on x86 I see this generates instructions 
vmovdpu/movdqu that are not always guaranteed to be atomic. x86 though 
guarantees them to be atomic if the memory address is aligned to 16 bytes.

No, AMD guarantees MOVDQU is atomic if aligned, Intel does not.
See the comment in util/cpuinfo-i386.c, and the two CPUINFO_ATOMIC_VMOVDQ[AU] 
bits.

See also host/include/*/host/atomic128-ldst.h, HAVE_ATOMIC128_RO, and 
atomic16_read_ro.
Not that I think you should use that here; it's complicated, and I think you're 
better off relying on the code in accel/tcg/ when more than byte atomicity is 
required.


Not sure if that's what you meant but I didn't find any clear example of
multi-byte atomicity using qatomic_read() and friends that would be closer
to what memcpy() is doing here. I found one example in bdrv_graph_co_rdlock()
that seems to use a mem barrier via smp_mb() and qatomic_read() inside a
loop, but I don't understand that code enough to say.

I'm also wondering if a common pthread_lock() wrapping up these memcpy() calls
would suffice in this case. Even if we can't guarantee that __builtin_memcpy()
will use arch specific vector insns in the host it would already be a faster
path than falling back to fn(...).

In a quick detour, I'm not sure if we really considered how ARM SVE implements 
these
helpers. E.g gen_sve_str():

https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/translate-sve.c#L4182

I'm curious to see if this form of front end optimization, use TCG ops
instead of a for() loop with ldst_elem() like we're doing today,  would yield 
more
performance with less complication with backend specifics, atomicity and 
whatnot.
In fact I have a feeling this is not the first time we talk about using ideas 
from
SVE too.


Thanks,

Daniel




r~

Reply via email to