s390x: Remove TCG_REG_TB

Richard Henderson Tue, 06 Dec 2022 14:23:50 -0800

On 12/6/22 13:29, Ilya Leoshkevich wrote:

On Thu, Dec 01, 2022 at 10:51:49PM -0800, Richard Henderson wrote:

This reverts 829e1376d940 ("tcg/s390: Introduce TCG_REG_TB"), and
several follow-up patches.  The primary motivation is to reduce the
less-tested code paths, pre-z10.  Secondarily, this allows the
unconditional use of TCG_TARGET_HAS_direct_jump, which might be more
important for performance than any slight increase in code size.


Signed-off-by: Richard Henderson <richard.hender...@linaro.org>
---
  tcg/s390x/tcg-target.h     |   2 +-
  tcg/s390x/tcg-target.c.inc | 176 +++++--------------------------------
  2 files changed, 23 insertions(+), 155 deletions(-)


Reviewed-by: Ilya Leoshkevich <i...@linux.ibm.com>

I have a few questions/ideas for the future below.

diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index 22d70d431b..645f522058 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -103,7 +103,7 @@ extern uint64_t s390_facilities[3];
  #define TCG_TARGET_HAS_mulsh_i32      0
  #define TCG_TARGET_HAS_extrl_i64_i32  0
  #define TCG_TARGET_HAS_extrh_i64_i32  0
-#define TCG_TARGET_HAS_direct_jump    HAVE_FACILITY(GEN_INST_EXT)
+#define TCG_TARGET_HAS_direct_jump    1


This change doesn't seem to affect that, but what is the minimum
supported s390x qemu host? z900?

Possibly z990, if I'm reading the gcc processor_flags_table[] correctly;long-displacement-facility is definitely a minimum.

We probably should revisit what the minimum for TCG should be, assert those features atstartup, and drop the corresponding runtime tests.

I did some benchmarking of various ways to load constants in context of
GCC in the past, and it turned out that LLIHF+OILF is more efficient
than literal pool [1].

Interesting. If we include extended-immediate-facility (base_GEN9_GA1, z9-109?) in thebare minimum that would definitely simplify a few things.

-    /* Use the constant pool if USE_REG_TB, but not for small constants.  */
-    if (maybe_out_small_movi(s, type, TCG_TMP0, val)) {
-        if (type == TCG_TYPE_I32) {
-            tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
-        } else {
-            tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
-        }
-    } else if (USE_REG_TB) {
-        tcg_out_insn(s, RXY, XG, dest, TCG_REG_TB, TCG_REG_NONE, 0);
-        new_pool_label(s, val, R_390_20, s->code_ptr - 2,
-                       tcg_tbrel_diff(s, NULL));
+    tcg_out_movi(s, type, TCG_TMP0, val);
+    if (type == TCG_TYPE_I32) {
+        tcg_out_insn(s, RR, XR, dest, TCG_TMP0);
      } else {
-        /* Perform the xor by parts.  */
-        tcg_debug_assert(HAVE_FACILITY(EXT_IMM));
-        if (val & 0xffffffff) {
-            tcg_out_insn(s, RIL, XILF, dest, val);
-        }
-        if (val > 0xffffffff) {
-            tcg_out_insn(s, RIL, XIHF, dest, val >> 32);
-        }
+        tcg_out_insn(s, RRE, XGR, dest, TCG_TMP0);
      }
  }


Wouldn't it be worth keeping XILF/XIFH here?


I don't know.  It's difficult for me to guess whether a dependency chain like

    val -> xor -> xor

(3 insns with serial dependencies) is better than

    val   --> xor
    load  -/

(3 insns, but only one serial dependency) is better. But there may also be instructionfusion going on at the micro-architectural level, so that there's really only one xor.


If you have suggestions, I'm all ears.

I don't have any numbers right now, but it looks more compact/efficient
than a load + XGR.

If we assume general-instruction-extension-facility (z10?), LGRL + XGR is smaller thanXILF + XIFH, ignoring the constant pool entry which might be shared, and modulo the µarchquestions above.

Same for OGR above; I even wonder if both implementations could be
unified.


Sadly not, because of OILL et al.  There are no 16-bit xor immediate insns.

+        /*
+         * branch displacement must be aligned for atomic patching;
+         * see if we need to add extra nop before branch
+         */
+        if (!QEMU_PTR_IS_ALIGNED(s->code_ptr + 1, 4)) {
+            tcg_out16(s, NOP);
          }
+        tcg_out16(s, RIL_BRCL | (S390_CC_ALWAYS << 4));
+        s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
+        tcg_out32(s, 0);
          set_jmp_reset_offset(s, a0);


This seems to work in practice, but I don't think patching branch
offsets is allowed by PoP (in a multi-threaded environment). For
example, I had to do [2] in order to work around this limitation in
ftrace.

Really? How does the processor distinguish between overwriting opcode/condition vsoverwriting immediate operand when invalidating cached instructions?

If overwriting operand truly isn't correct, then I think we have to use indirect branchalways for goto_tb.

A third benefit seems to be that we now have one more register to
allocate.


Yes.  It's call clobbered, so it isn't live so often, but sometimes.


r~

Re: [PATCH v3 02/13] tcg/s390x: Remove TCG_REG_TB

Reply via email to