On Fri, Dec 21, 2018 at 11:57:55AM -0600, Tamar Christina wrote: > Hi All, > > This updated patch adds NEON intrinsics and tests for the Armv8.3-a complex > multiplication and add instructions with a rotate along the Argand plane. > > The instructions are documented in the ArmARM[1] and the intrinsics > specification > will be published on the Arm website [2]. > > The Lane versions of these instructions are special in that they always > select a pair. > using index 0 means selecting lane 0 and 1. Because of this the range check > for the > intrinsics require special handling. > > There're a few complexities with the intrinsics for the laneq variants for > AArch64: > > 1) The architecture does not have a version for V2SF. However since the > instructions always > selects a pair of values, the only valid index for V2SF would have been 0. > As such the lane > versions for V2SF are all mapped to the 3SAME variant of the instructions > and not the By element > variant. > > 2) Because of no# 1 above, the laneq versions of the instruction become > tricky. The valid indices are 0 and 1. > For index 0 we treat it the same as the lane version of this instruction > and just pass the lower half of the > register to the 3SAME instruction. When index is 1 we extract the upper > half of the register and pass that to > the 3SAME version of the instruction. > > 2) The architecture forbits the laneq version of the V4HF instruction from > having an index greater than 1. For index 0-1 > we do no extra work. For index 2-3 we extract the upper parts of the > register and pass that to the instruction it would > have normally used, and re-map the index into a range of 0-1. > > [1] > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile > [2] https://developer.arm.com/docs/101028/latest > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > Additional runtime checks done but not posted with the patch. > > Ok for trunk?
OK with a refactor. This isn't a great fit for Stage 4, but it is also completely self-contained. I hope we can slow down new content in the AArch64 back-end and start stabilising the port for release. Thanks, James > @@ -1395,6 +1494,80 @@ aarch64_expand_builtin (tree exp, > } > > return target; > + > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ0_V2SF: > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ90_V2SF: > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ180_V2SF: > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ270_V2SF: > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ0_V4HF: > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ90_V4HF: > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ180_V4HF: > + case AARCH64_SIMD_BUILTIN_FCMLA_LANEQ270_V4HF: Pull all of this out to a new function please. > + int bcode = fcode - AARCH64_SIMD_FCMLA_LANEQ_BUILTIN_BASE - 1; > + aarch64_fcmla_laneq_builtin_datum* d > + = &aarch64_fcmla_lane_builtin_data[bcode]; > + machine_mode quadmode = GET_MODE_2XWIDER_MODE (d->mode).require (); > + op0 = force_reg (d->mode, expand_normal (CALL_EXPR_ARG (exp, 0))); > + rtx op1 = force_reg (d->mode, expand_normal (CALL_EXPR_ARG (exp, 1))); > + rtx op2 = force_reg (quadmode, expand_normal (CALL_EXPR_ARG (exp, 2))); > + tree tmp = CALL_EXPR_ARG (exp, 3); > + rtx lane_idx = expand_expr (tmp, NULL_RTX, VOIDmode, > EXPAND_INITIALIZER); > + > + /* Validate that the lane index is a constant. */ > + if (!CONST_INT_P (lane_idx)) > + { > + error ("%Kargument %d must be a constant immediate", exp, 4); > + return const0_rtx; > + } > + > + /* Validate that the index is within the expected range. */ > + int nunits = GET_MODE_NUNITS (quadmode).to_constant (); > + aarch64_simd_lane_bounds (lane_idx, 0, nunits / 2, exp); > + > + /* Keep to GCC-vector-extension lane indices in the RTL. */ > + lane_idx = aarch64_endian_lane_rtx (quadmode, INTVAL (lane_idx)); > + > + /* Generate the correct register and mode. */ > + int lane = INTVAL (lane_idx); > + > + if (lane < nunits / 4) > + op2 = simplify_gen_subreg (d->mode, op2, quadmode, 0); > + else > + { > + /* Select the upper 64 bits, either a V2SF or V4HF, this however > + is quite messy, as the operation required even though simple > + doesn't have a simple RTL pattern, and seems it's quite hard to > + define using a single RTL pattern. The target generic version > + gen_highpart_mode generates code that isn't optimal. */ > + rtx temp1 = gen_reg_rtx (d->mode); > + rtx temp2 = gen_reg_rtx (DImode); > + temp1 = simplify_gen_subreg (d->mode, op2, quadmode, 0); > + temp1 = simplify_gen_subreg (V2DImode, temp1, d->mode, 0); > + emit_insn (gen_aarch64_get_lanev2di (temp2, temp1 , const1_rtx)); > + op2 = simplify_gen_subreg (d->mode, temp2, GET_MODE (temp2), 0); > + > + /* And recalculate the index. */ > + lane -= nunits / 4; > + } > + > + if (!target) > + target = gen_reg_rtx (d->mode); > + else > + target = force_reg (d->mode, target); > + > + rtx pat = NULL_RTX; > + > + if (d->lane) > + pat = GEN_FCN (d->icode) (target, op0, op1, op2, > + gen_int_mode (lane, SImode)); > + else > + pat = GEN_FCN (d->icode) (target, op0, op1, op2); > + > + if (!pat) > + return NULL_RTX; > + > + emit_insn (pat); > + return target; > } > > if (fcode >= AARCH64_SIMD_BUILTIN_BASE && fcode <= > AARCH64_SIMD_BUILTIN_MAX)