Re: [PATCH 6/9][GCC][AArch64] Add Armv8.3-a complex intrinsics

James Greenhalgh Wed, 09 Jan 2019 07:37:04 -0800

On Fri, Dec 21, 2018 at 11:57:55AM -0600, Tamar Christina wrote:
> Hi All,
> 
> This updated patch adds NEON intrinsics and tests for the Armv8.3-a complex
> multiplication and add instructions with a rotate along the Argand plane.
> 
> The instructions are documented in the ArmARM[1] and the intrinsics 
> specification
> will be published on the Arm website [2].
> 
> The Lane versions of these instructions are special in that they always 
> select a pair.
> using index 0 means selecting lane 0 and 1.  Because of this the range check 
> for the
> intrinsics require special handling.
> 
> There're a few complexities with the intrinsics for the laneq variants for 
> AArch64:
> 
> 1) The architecture does not have a version for V2SF. However since the 
> instructions always
>    selects a pair of values, the only valid index for V2SF would have been 0. 
> As such the lane
>    versions for V2SF are all mapped to the 3SAME variant of the instructions 
> and not the By element
>    variant.
> 
> 2) Because of no# 1 above, the laneq versions of the instruction become 
> tricky. The valid indices are 0 and 1.
>    For index 0 we treat it the same as the lane version of this instruction 
> and just pass the lower half of the
>    register to the 3SAME instruction.  When index is 1 we extract the upper 
> half of the register and pass that to
>    the 3SAME version of the instruction.
> 
> 2) The architecture forbits the laneq version of the V4HF instruction from 
> having an index greater than 1.  For index 0-1
>    we do no extra work. For index 2-3 we extract the upper parts of the 
> register and pass that to the instruction it would
>    have normally used, and re-map the index into a range of 0-1.
> 
> [1] 
> https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
> [2] https://developer.arm.com/docs/101028/latest
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Additional runtime checks done but not posted with the patch.
> 
> Ok for trunk?


OK with a refactor.

This isn't a great fit for Stage 4, but it is also completely self-contained.

I hope we can slow down new content in the AArch64 back-end and start
stabilising the port for release.

Thanks,
James

> @@ -1395,6 +1494,80 @@ aarch64_expand_builtin (tree exp,
>       }
>  
>        return target;
> +
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ0_V2SF:
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ90_V2SF:
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ180_V2SF:
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ270_V2SF:
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ0_V4HF:
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ90_V4HF:
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ180_V4HF:
> +    case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ270_V4HF:

Pull all of this out to a new function please.

> +      int bcode = fcode - AARCH64_SIMD_FCMLA_LANEQ_BUILTIN_BASE - 1;
> +      aarch64_fcmla_laneq_builtin_datum* d
> +     = &aarch64_fcmla_lane_builtin_data[bcode];
> +      machine_mode quadmode = GET_MODE_2XWIDER_MODE (d->mode).require ();
> +      op0 = force_reg (d->mode, expand_normal (CALL_EXPR_ARG (exp, 0)));
> +      rtx op1 = force_reg (d->mode, expand_normal (CALL_EXPR_ARG (exp, 1)));
> +      rtx op2 = force_reg (quadmode, expand_normal (CALL_EXPR_ARG (exp, 2)));
> +      tree tmp = CALL_EXPR_ARG (exp, 3);
> +      rtx lane_idx = expand_expr (tmp, NULL_RTX, VOIDmode, 
> EXPAND_INITIALIZER);
> +
> +      /* Validate that the lane index is a constant.  */
> +      if (!CONST_INT_P (lane_idx))
> +     {
> +       error ("%Kargument %d must be a constant immediate", exp, 4);
> +       return const0_rtx;
> +     }
> +
> +      /* Validate that the index is within the expected range.  */
> +      int nunits = GET_MODE_NUNITS (quadmode).to_constant ();
> +      aarch64_simd_lane_bounds (lane_idx, 0, nunits / 2, exp);
> +
> +      /* Keep to GCC-vector-extension lane indices in the RTL.  */
> +      lane_idx = aarch64_endian_lane_rtx (quadmode, INTVAL (lane_idx));
> +
> +      /* Generate the correct register and mode.  */
> +      int lane = INTVAL (lane_idx);
> +
> +      if (lane < nunits / 4)
> +     op2 = simplify_gen_subreg (d->mode, op2, quadmode, 0);
> +      else
> +     {
> +       /* Select the upper 64 bits, either a V2SF or V4HF, this however
> +          is quite messy, as the operation required even though simple
> +          doesn't have a simple RTL pattern, and seems it's quite hard to
> +          define using a single RTL pattern.  The target generic version
> +          gen_highpart_mode generates code that isn't optimal.  */
> +       rtx temp1 = gen_reg_rtx (d->mode);
> +       rtx temp2 = gen_reg_rtx (DImode);
> +       temp1 = simplify_gen_subreg (d->mode, op2, quadmode, 0);
> +       temp1 = simplify_gen_subreg (V2DImode, temp1, d->mode, 0);
> +       emit_insn (gen_aarch64_get_lanev2di (temp2, temp1     , const1_rtx));
> +       op2 = simplify_gen_subreg (d->mode, temp2, GET_MODE (temp2), 0);
> +
> +       /* And recalculate the index.  */
> +       lane -= nunits / 4;
> +     }
> +
> +      if (!target)
> +     target = gen_reg_rtx (d->mode);
> +      else
> +     target = force_reg (d->mode, target);
> +
> +      rtx pat = NULL_RTX;
> +
> +      if (d->lane)
> +     pat = GEN_FCN (d->icode) (target, op0, op1, op2,
> +                              gen_int_mode (lane, SImode));
> +      else
> +     pat = GEN_FCN (d->icode) (target, op0, op1, op2);
> +
> +      if (!pat)
> +     return NULL_RTX;
> +
> +      emit_insn (pat);
> +      return target;
>      }
>  
>    if (fcode >= AARCH64_SIMD_BUILTIN_BASE && fcode <= 
> AARCH64_SIMD_BUILTIN_MAX)

Re: [PATCH 6/9][GCC][AArch64] Add Armv8.3-a complex intrinsics

Reply via email to