On 2024/9/5 14:03, Richard Henderson wrote:
On 9/4/24 07:27, LIU Zhiwei wrote:
From: TANG Tiancheng <tangtiancheng....@alibaba-inc.com>

In RISC-V, vector operations require initial configuration using
the vset{i}vl{i} instruction.

This instruction:
   1. Sets the vector length (vl) in bytes
   2. Configures the vtype register, which includes:
     SEW (Single Element Width)
     LMUL (vector register group multiplier)
     Other vector operation parameters

This configuration is crucial for defining subsequent vector
operation behavior. To optimize performance, the configuration
process is managed dynamically:
   1. Reconfiguration using vset{i}vl{i} is necessary when SEW
      or vector register group width changes.
   2. The vset instruction can be omitted when configuration
      remains unchanged.

This optimization is only effective within a single TB.
Each TB requires reconfiguration at its start, as the current
state cannot be obtained from hardware.

Signed-off-by: TANG Tiancheng <tangtiancheng....@alibaba-inc.com>
Signed-off-by: Weiwei Li <liwei1...@gmail.com>
Reviewed-by: Liu Zhiwei <zhiwei_...@linux.alibaba.com>
---
  include/tcg/tcg.h          |   3 +
  tcg/riscv/tcg-target.c.inc | 128 +++++++++++++++++++++++++++++++++++++
  2 files changed, 131 insertions(+)

diff --git a/include/tcg/tcg.h b/include/tcg/tcg.h
index 21d5884741..267e6ff95c 100644
--- a/include/tcg/tcg.h
+++ b/include/tcg/tcg.h
@@ -566,6 +566,9 @@ struct TCGContext {
        /* Exit to translator on overflow. */
      sigjmp_buf jmp_trans;
+
+    /* For host-specific values. */
+    int riscv_host_vtype;
  };

(1) At minimum this needs #ifdef __riscv.
    I planned to think of a cleaner way to do this,
    but haven't gotten there yet.
    I had also planned to place it higher in the structure, before
    the large temp arrays, so that the structure offset would be smaller.

(2) I have determined through experimentation that vtype alone is insufficient.
    While vtype + avl would be sufficient, it is inefficient.
    Best to store the original inputs: TCGType and SEW, since that way
    there's no effort required when querying the current SEW for use in
    load/store/logicals.

    The bug here appears as TCG swaps between TCGTypes for different
    operations.  E.g. if the vtype computed for (V64, E8) is the same
    as the vtype computed for (V128, E8), with AVL differing, then we
    will incorrectly omit the vsetvl instruction.

    My test case was tcg/tests/aarch64-linux-user/sha1-vector

Agree.

The naming of these functions is varied and inconsistent.
I suggest the following:


static void set_vtype(TCGContext *s, TCGType type, MemOp vsew)
{
    unsigned vtype, insn, avl;
    int lmul;
    RISCVVlmul vlmul;
    bool lmul_eq_avl;

    s->riscv_cur_type = type;
    s->riscv_cur_vsew = vsew;

    /* Match riscv_lg2_vlenb to TCG_TYPE_V64. */
    QEMU_BUILD_BUG_ON(TCG_TYPE_V64 != 3);

    lmul = type - riscv_lg2_vlenb;
    if (lmul < -3) {
        /* Host VLEN >= 1024 bits. */
        vlmul = VLMUL_M1;
I am not sure if we should use VLMUL_MF8,
lmul_eq_avl = false;
    } else if (lmul < 3) {
        /* 1/8 ... 1 ... 8 */
        vlmul = lmul & 7;
        lmul_eq_avl = true;
    } else {
        /* Guaranteed by Zve64x. */
        g_assert_not_reached();
    }

    avl = tcg_type_size(type) >> vsew;
    vtype = encode_vtype(true, true, vsew, vlmul);

    if (avl < 32) {
        insn = encode_i(OPC_VSETIVLI, TCG_REG_ZERO, avl, vtype);
Which may benifit here? we usually use  lmul as smallest as we can for macro ops split.
    } else if (lmul_eq_avl) {
        /* rd != 0 and rs1 == 0 uses vlmax */
        insn = encode_i(OPC_VSETVLI, TCG_REG_TMP0, TCG_REG_ZERO, vtype);
    } else {
        tcg_out_opc_imm(s, OPC_ADDI, TCG_REG_TMP0, TCG_REG_ZERO, avl);
        insn = encode_i(OPC_VSETVLI, TCG_REG_ZERO, TCG_REG_TMP0, vtype);
And perhaps here.
    }
    tcg_out32(s, insn);
}

static MemOp set_vtype_len(TCGContext *s, TCGType type)
{
    if (type != s->riscv_cur_type) {
        set_type(s, type, MO_64);
I think you mean set_vtype here.
    }
    return s->riscv_cur_vsew;
}

static void set_vtype_len_sew(TCGContext *s, TCGType type, MemOp vsew)
{
    if (type != s->riscv_cur_type || vsew != s->riscv_cur_vsew) {
        set_type(s, type, vsew);

and set_vtype here.

Thanks,
Zhiwei

    }
}


(1) The storing of lg2(vlenb) means we can convert all of the division into subtraction.
(2) get_vec_type_bytes() already exists as tcg_type_size().
(3) Make use of the signed 3-bit encoding of vlmul.
(4) Make use of rd != 0, rs1 = 0 for the relatively common case of AVL = 32.


r~

Reply via email to