On Wed, Jun 9, 2021 at 12:44 AM Uros Bizjak <ubiz...@gmail.com> wrote:
>
> [For some reason this message didn't reach my gmail account]
>
> > 1. Update move expanders to convert the CONST_WIDE_INT and CONST_VECTO
> > operands to vector broadcast from an integer with AVX2.
> > 2. Add ix86_gen_scratch_sse_rtx to return a scratch SSE register which
> > won't increase stack alignment requirement and blocks transformation by
> > the combine pass.
> > 3. Update PR 87767 tests to expect integer broadcast instead of broadcast
> > from memory.
> > 4. Update avx512f_cond_move.c to expect integer broadcast.
>
> +  else if (TARGET_64BIT
> +   && ix86_broadcast (val, GET_MODE_BITSIZE (DImode),
> +      val_broadcast))
> +    {
> +      /* NB: MOVQ takes a 32-bit signed immediate operand.  */
> +      if (trunc_int_for_mode (val_broadcast, SImode) != val_broadcast)
> + return nullptr;
> +      broadcast_mode = DImode;
> +    }
> +  else
> +    return nullptr;
>
> We have MOVABS insn and movdi_internal knows when to switch between
> MOVQ and MOVABS.

Fixed in the v4 patch.

> +  if (!ix86_expand_vector_init_duplicate (false, vector_mode, target,
> +  GEN_INT (val_broadcast)))
> +    gcc_unreachable ();
>
> We are using:
>
> bool ok = ix86_expand_vector_init_duplicate (...);
> gcc_assert (ok);
>
> idiom throughout i386/. Let's keep it this way.

Fixed in the v4 patch.

> +  if (REGNO (target) < FIRST_PSEUDO_REGISTER)
> +    target = gen_rtx_REG (mode, REGNO (target));
> +  else
> +    target = convert_to_mode (mode, target, 1);
> +
>
> This is not needed. lowpart_subreg should do the trick when changing
> mode of hard regs (also see comment for ix86_gen_scratch_sse_rtx).

Fixed in the v4 patch.

> +  rtx first;
> +
> +  if (can_create_pseudo_p ()
> +      && GET_MODE_SIZE (mode) >= 16
> +      && GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> +      && (MEM_P (op1)
> +  && SYMBOL_REF_P (XEXP (op1, 0))
> +  && CONSTANT_POOL_ADDRESS_P (XEXP (op1, 0)))
> +      && (first = ix86_broadcast_from_integer_constant (mode, op1)))
> +    {
> +      /* Broadcast to XMM/YMM/ZMM register from an integer constant.  */
> +      op1 = ix86_gen_scratch_sse_rtx (mode, false);
> +      if (!ix86_expand_vector_init_duplicate (false, mode, op1, first))
> + gcc_unreachable ();
> +      emit_move_insn (op0, op1);
> +      return;
>
> Please try to avoid assignment inside the condition. And also use
> "gcc_assert (ok)" here.

Fixed in the v4 patch.

> +/* Return a scratch register in MODE for vector load and store.  If
> +   CONSTANT_INT_BROADCAST is true, it is used to hold constant integer
> +   broadcast result.  */
> +
> +rtx
> +ix86_gen_scratch_sse_rtx (machine_mode mode,
> +  bool constant_int_broadcast)
>
> This function should always return hard reg, simply:
>
> return gen_rtx_REG (mode, (TARGET_64BIT
>   ? LAST_REX_SSE_REG : LAST_SSE_REG));
>
> The complications with pseudo does not bring us anything (at the end
> we need a hard reg anyway, and I guess reload knows quite well how to
> avoid used temporary).

Fixed in the v4 patch.

> The function can then be renamed to ix86_gen_scratch_sse_reg.

I'd like to keep the ix86_gen_scratch_sse_rtx name:

rtx
ix86_gen_scratch_sse_rtx (machine_mode mode)
{
  if (TARGET_SSE)
    return gen_rtx_REG (mode, (TARGET_64BIT
                               ? LAST_REX_SSE_REG
                               : LAST_SSE_REG));
  else
    return gen_reg_rtx (mode);
}

so that it can be used to implement the target hook for memset later.

> * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Expect integer
> broadcast.
> * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/avx512f_cond_move.c: Also pass
> -mprefer-vector-width=512 and expect integer broadcast.
>
> No review for the above changes for AVX512 tests, someone else should
> check if the new code is better here.

A small benchmark:

https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/vpaddd/broadcast

shows that integer broadcast is faster than embedded memory broadcast:

$ make
gcc -g -I. -O2 -march=skylake-avx512   -c -o test.o test.c
gcc -g   -c -o memory.o memory.S
gcc -g   -c -o broadcast.o broadcast.S
gcc -o test test.o memory.o broadcast.o
./test
memory      : 425538
broadcast   : 375260
$

Thanks.

-- 
H.J.

Reply via email to