Hi All,
I'm trying to implement DImode shifts using ARM NEON instructions. This
wouldn't be difficult in itself, but making it play nice with the
existing implementation is causing me problems. I'd like a few
suggestions/pointers/comments to help me get this right, please.
The existing shift mechanisms must be kept, partly because the NEON unit
is optional, and partly because it does not permit the full range of
DImode operations, so sometimes it's more efficient to do 64-bit
operations in core-registers, rather than copy all the values over to
NEON, do the operation, and move the result back. Which set of patterns
are used is determined by the register allocator and its costs mechanism.
The late decision means that the patterns may only use the post-reload
splitter, and so cannot rely on many of the usual passes to sort out
inefficiencies. In particular, the lack of combine makes it hard to
detect and optimize extend-and-copy sequences.
So, I've attached two patches. The first is neon-shifts.patch, and does
most of the work. The second is extendsidi2_neon.patch, and is intended
to aid moving the shift amount from SImode registers, but doesn't go as
far as I'd like.
I've not actually tested any of the output code just yet, so there may
be logic errors, but those are easily fixed later, and what I'm trying
to get right here is the GCC machine description.
Given this testcase:
void
f (long long *a, int b)
{
*a = *a << b;
}
Without any patches, GCC gives this output, using only ARM core
registers (in thumb2 mode):
f:
ldr r2, [r0, #0]
ldr r3, [r0, #4]
push {r4, r5, r6}
rsb r6, r1, #32
sub r4, r1, #32
lsrs r6, r2, r6
lsls r5, r2, r4
lsls r3, r3, r1
lsls r1, r2, r1
orrs r3, r3, r6
str r1, [r0, #0]
ands r4, r3, r4, asr #32
it cc
movcc r4, r5
str r4, [r0, #4]
pop {r4, r5, r6}
bx lr
With just neon-shifts.patch, we get this output, now with NEON shifts:
f:
fldd d17, [r0, #0] @ int
mov r2, r1
movs r3, #0
push {r4, r5}
fmdrr d18, r2, r3 @ int
vshl.i64 d16, d17, d18
fstd d16, [r0, #0] @ int
pop {r4, r5}
bx lr
As you can see, the shift is much improved, but the shift amount is
first extended into two SImode registers, and then moved to a NEON
DImode register, which increases core-register pressure unnecessarily.
With both patches, we now get this:
f:
fldd d17, [r0, #0] @ int
vdup.32 d16, r1
vshr.u64 d16, d16, #32 <-- still unnecessary
vshl.i64 d16, d17, d16
fstd d16, [r0, #0] @ int
bx lr
Now the value is copied and then extended. I have chosen to use vdup.32
instead of vmov.i32 because the latter can only target half the DImode
registers. The right shift is necessary for a general zero-extend, but
is not useful in this case as only the bottom 8 bits are interesting,
and vdup has already done the right thing.
Note that the examples I've given are for left shifts. Right shifts are
also implemented, but are a little more complicated (in the
shift-by-register case) because the shift must be implemented as a left
shift by a negative amount, and so an unspec is used to prevent the
compiler doing anything 'clever'. Apart from an extra negation, the end
result is much the same, but the patterns look different.
All this is a nice improvement, but I'm not happy:
1. The post-reload split means that I've had to add a clobber for CC to
all the patterns, even though only some of them really need it. I think
I've convinced myself that this is ok because it doesn't matter before
scheduling, and after splitting the clobbers are only retained if
they're really needed, but it still feels wrong.
2. The extend optimization is fine for general case extends, but it can
be improved for the shift-amount case because we actually only need the
bottom 8 bits, as indicated above. The problem is that there's no
obvious way to achieve this:
- there's no combine pass after this point, so a pattern that
recognises and re-splits the extend, move and shift can't be used.
- I don't believe there can be a pattern that uses SImode for the
shift amount because the value needs to be in a DImode register
eventually, and that means one needs to have been allocated before it
gets split, and that means the extend needs to be separate.
3. The type of the shift-amount is determined by the type used in the
ashldi3 pattern, and that uses SImode. This is fine for values already
in SImode registers (probably the common case), but means that values
already in DImode registers will have to get truncated and then
re-extended, and this is not an operation that can generally be
optimized away once introduced.
- I've considered using a DImode shift-amount for the ashldi3
pattern, and that would solve this problem - extend and truncate *can*
be optimized away, but since it doesn't get split until post reload, the
register allocator would already have allocated two SImode registers
before we have any chance to make it go away.
4. I'm not sure, but I think the general-case shift in core registers is
sufficiently long-winded that it might be worthwhile completely
discarding that option (i.e. it might cheaper to just always use neon
shifts, when neon is available, of course). I'd keep the
shift-by-constant-amount variants though. Does anybody have any comments
on that?
5. The left and right shift patterns couldn't be unified because I
couldn't find a way to do match_operand with unspecs, and anyway, the
patterns are a slightly different shape.
6. Same with the logical and arithmetic right shifts; I couldn't find a
way to unify those patterns either, even though the only difference is
the unspec index number.
Any help would be appreciated. I've probably implemented this backwards,
or something ...
Thanks a lot
Andrew
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -3441,7 +3441,13 @@
(match_operand:SI 2 "reg_or_int_operand" "")))]
"TARGET_32BIT"
"
- if (GET_CODE (operands[2]) == CONST_INT)
+ if (TARGET_NEON)
+ {
+ rtx reg = convert_to_mode (DImode, operands[2], 1);
+ emit_insn (gen_ashldi3_neon (operands[0], operands[1], reg));
+ DONE;
+ }
+ else if (GET_CODE (operands[2]) == CONST_INT)
{
if ((HOST_WIDE_INT) INTVAL (operands[2]) == 1)
{
@@ -3460,8 +3466,8 @@
)
(define_insn "arm_ashldi3_1bit"
- [(set (match_operand:DI 0 "s_register_operand" "=r,&r")
- (ashift:DI (match_operand:DI 1 "s_register_operand" "0,r")
+ [(set (match_operand:DI 0 "arm_general_register_operand" "=r,&r")
+ (ashift:DI (match_operand:DI 1 "arm_general_register_operand" "0,r")
(const_int 1)))
(clobber (reg:CC CC_REGNUM))]
"TARGET_32BIT"
@@ -3500,7 +3506,13 @@
(match_operand:SI 2 "reg_or_int_operand" "")))]
"TARGET_32BIT"
"
- if (GET_CODE (operands[2]) == CONST_INT)
+ if (TARGET_NEON)
+ {
+ rtx reg = convert_to_mode (DImode, operands[2], 1);
+ emit_insn (gen_ashrdi3_neon (operands[0], operands[1], reg));
+ DONE;
+ }
+ else if (GET_CODE (operands[2]) == CONST_INT)
{
if ((HOST_WIDE_INT) INTVAL (operands[2]) == 1)
{
@@ -3557,7 +3569,13 @@
(match_operand:SI 2 "reg_or_int_operand" "")))]
"TARGET_32BIT"
"
- if (GET_CODE (operands[2]) == CONST_INT)
+ if (TARGET_NEON)
+ {
+ rtx reg = convert_to_mode (DImode, operands[2], 1);
+ emit_insn (gen_lshrdi3_neon (operands[0], operands[1], reg));
+ DONE;
+ }
+ else if (GET_CODE (operands[2]) == CONST_INT)
{
if ((HOST_WIDE_INT) INTVAL (operands[2]) == 1)
{
--- a/gcc/config/arm/constraints.md
+++ b/gcc/config/arm/constraints.md
@@ -29,7 +29,7 @@
;; in Thumb-1 state: I, J, K, L, M, N, O
;; The following multi-letter normal constraints have been used:
-;; in ARM/Thumb-2 state: Da, Db, Dc, Dn, Dl, DL, Dv, Dy, Di, Dz
+;; in ARM/Thumb-2 state: Da, Db, Dc, Dn, Dl, DL, Dv, Dy, Di, Dz, Pe
;; in Thumb-1 state: Pa, Pb, Pc, Pd
;; in Thumb-2 state: Pj, PJ, Ps, Pt, Pu, Pv, Pw, Px, Py
@@ -172,6 +172,11 @@
(and (match_code "const_int")
(match_test "TARGET_THUMB1 && ival >= 0 && ival <= 7")))
+(define_constraint "Pe"
+ "@internal In ARM/Thumb-2 state, a constant in the range 0 to 63"
+ (and (match_code "const_int")
+ (match_test "TARGET_32BIT && ival >= 0 && ival < 64")))
+
(define_constraint "Ps"
"@internal In Thumb-2 state a constant in the range -255 to +255"
(and (match_code "const_int")
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -1090,6 +1090,279 @@
DONE;
})
+;; 64-bit shifts
+
+(define_insn "ashldi3_neon"
+ [(set (match_operand:DI 0 "s_register_operand" "=w, w,?&r,?&r,?w,?w")
+ (ashift:DI (match_operand:DI 1 "s_register_operand" " w, w, r, r, w, w")
+ (match_operand:DI 2 "shift_amount_64" " w,Pe, r, Pe, w,Pe")))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON"
+ "@
+ vshl.u64\t%P0, %P1, %P2
+ vshl.u64\t%P0, %P1, %2
+ #
+ #
+ vshl.u64\t%P0, %P1, %P2
+ vshl.u64\t%P0, %P1, %2"
+ [(set_attr "neon_type" "neon_vshl_ddd,neon_vshl_ddd,*,*,neon_vshl_ddd,neon_vshl_ddd")
+ (set_attr "length" "*,*,28,12,*,*")
+ (set_attr "arch" "nota8,nota8,*,*,onlya8,onlya8")]
+)
+
+;; Splitter for 64-bit shifts in core-regs.
+;; Register operands only; constant shift amounds are handled below.
+(define_split
+ [(set (match_operand:DI 0 "s_register_operand" "")
+ (ashift:DI (match_operand:DI 1 "s_register_operand" "")
+ (match_operand:DI 2 "s_register_operand" "")))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON && reload_completed && !(IS_VFP_REGNUM (REGNO (operands[0])))"
+ [(set (match_dup 5) (ashift:SI (match_dup 7) (match_dup 8)))
+ (parallel
+ [(set (reg:CC_NOOV CC_REGNUM) (compare:CC_NOOV (minus:SI (const_int 32) (match_dup 8)) (const_int 0)))
+ (set (match_dup 4) (minus:SI (const_int 32) (match_dup 8)))])
+ (cond_exec (ge:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (lshiftrt:SI (match_dup 6) (match_dup 4))))
+ (cond_exec (lt:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (neg:SI (match_dup 4))))
+ (cond_exec (lt:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (ashift:SI (match_dup 6) (match_dup 4))))
+ (set (match_dup 5) (ior:SI (match_dup 5) (match_dup 4)))
+ (set (match_dup 4) (ashift:SI (match_dup 6) (match_dup 8)))]
+ "
+ {
+ operands[4] = gen_lowpart (SImode, operands[0]);
+ operands[5] = gen_highpart (SImode, operands[0]);
+ operands[6] = gen_lowpart (SImode, operands[1]);
+ operands[7] = gen_highpart (SImode, operands[1]);
+ operands[8] = gen_lowpart (SImode, operands[2]);
+ }")
+
+(define_insn "ashrdi3_neon_imm"
+ [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?w")
+ (ashiftrt:DI (match_operand:DI 1 "s_register_operand" " w, r, w")
+ (match_operand:DI 2 "int_0_to_63" "Pe, Pe,Pe")))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON"
+ "@
+ vshr.s64\t%P0, %P1, %2
+ #
+ vshr.s64\t%P0, %P1, %2"
+ [(set_attr "neon_type" "neon_vshl_ddd,*,neon_vshl_ddd")
+ (set_attr "length" "*,12,*")
+ (set_attr "arch" "nota8,*,onlya8")]
+)
+
+(define_insn_and_split "ashrdi3_neon_reg"
+ [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?w")
+ (unspec:DI [(match_operand:DI 1 "s_register_operand" " w, r, w")
+ (match_operand:DI 2 "s_register_operand" " w, r, w")]
+ UNSPEC_ASHIFT_SIGNED))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON"
+ "@
+ vshl.s64\t%P0, %P1, %P2
+ #
+ vshl.s64\t%P0, %P1, %P2"
+ "TARGET_NEON && reload_completed && !(IS_VFP_REGNUM (REGNO (operands[0])))"
+ [(set (match_dup 5) (lshiftrt:SI (match_dup 7) (match_dup 8)))
+ (parallel
+ [(set (reg:CC_NOOV CC_REGNUM) (compare:CC_NOOV (minus:SI (const_int 32) (match_dup 8)) (const_int 0)))
+ (set (match_dup 4) (minus:SI (const_int 32) (match_dup 8)))])
+ (cond_exec (ge:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (ashift:SI (match_dup 6) (match_dup 4))))
+ (cond_exec (lt:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (neg:SI (match_dup 4))))
+ (cond_exec (lt:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (ashiftrt:SI (match_dup 6) (match_dup 4))))
+ (set (match_dup 5) (ior:SI (match_dup 5) (match_dup 4)))
+ (set (match_dup 4) (ashiftrt:SI (match_dup 6) (match_dup 8)))]
+ "
+ {
+ operands[4] = gen_highpart (SImode, operands[0]);
+ operands[5] = gen_lowpart (SImode, operands[0]);
+ operands[6] = gen_highpart (SImode, operands[1]);
+ operands[7] = gen_lowpart (SImode, operands[1]);
+ operands[8] = gen_lowpart (SImode, operands[2]);
+ }"
+ [(set_attr "neon_type" "neon_vshl_ddd,*,neon_vshl_ddd")
+ (set_attr "length" "*,28,*")
+ (set_attr "arch" "nota8,*,onlya8")]
+)
+
+
+(define_expand "ashrdi3_neon"
+ [(match_operand:DI 0 "s_register_operand" "")
+ (match_operand:DI 1 "s_register_operand" "")
+ (match_operand:DI 2 "shift_amount_64" "")]
+ "TARGET_NEON"
+{
+ rtx neg = gen_reg_rtx (DImode);
+ if (REG_P (operands[2]))
+ {
+ emit_insn (gen_negdi2 (neg, operands[2]));
+ emit_insn (gen_ashrdi3_neon_reg (operands[0], operands[1], neg));
+ }
+ else
+ emit_insn (gen_ashrdi3_neon_imm (operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+(define_insn "lshrdi3_neon_imm"
+ [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?w")
+ (lshiftrt:DI (match_operand:DI 1 "s_register_operand" " w, r, w")
+ (match_operand:DI 2 "int_0_to_63" "Pe, Pe,Pe")))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON"
+ "@
+ vshr.u64\t%P0, %P1, %2
+ #
+ vshr.u64\t%P0, %P1, %2"
+ [(set_attr "neon_type" "neon_vshl_ddd,*,neon_vshl_ddd")
+ (set_attr "length" "*,12,*")
+ (set_attr "arch" "nota8,*,onlya8")]
+)
+
+(define_insn_and_split "lshrdi3_neon_reg"
+ [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?w")
+ (unspec:DI [(match_operand:DI 1 "s_register_operand" " w, r, w")
+ (match_operand:DI 2 "s_register_operand" " w, r, w")]
+ UNSPEC_ASHIFT_UNSIGNED))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON"
+ "@
+ vshl.u64\t%P0, %P1, %P2
+ #
+ vshl.u64\t%P0, %P1, %P2"
+ "TARGET_NEON && reload_completed && !(IS_VFP_REGNUM (REGNO (operands[0])))"
+ [(set (match_dup 5) (lshiftrt:SI (match_dup 7) (match_dup 8)))
+ (parallel
+ [(set (reg:CC_NOOV CC_REGNUM) (compare:CC_NOOV (minus:SI (const_int 32) (match_dup 8)) (const_int 0)))
+ (set (match_dup 4) (minus:SI (const_int 32) (match_dup 8)))])
+ (cond_exec (ge:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (ashift:SI (match_dup 6) (match_dup 4))))
+ (cond_exec (lt:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (neg:SI (match_dup 4))))
+ (cond_exec (lt:CC (reg:CC CC_REGNUM) (const_int 0))
+ (set (match_dup 4) (lshiftrt:SI (match_dup 6) (match_dup 4))))
+ (set (match_dup 5) (ior:SI (match_dup 5) (match_dup 4)))
+ (set (match_dup 4) (lshiftrt:SI (match_dup 6) (match_dup 8)))]
+ "
+ {
+ operands[4] = gen_highpart (SImode, operands[0]);
+ operands[5] = gen_lowpart (SImode, operands[0]);
+ operands[6] = gen_highpart (SImode, operands[1]);
+ operands[7] = gen_lowpart (SImode, operands[1]);
+ operands[8] = gen_lowpart (SImode, operands[2]);
+ }"
+ [(set_attr "neon_type" "neon_vshl_ddd,*,neon_vshl_ddd")
+ (set_attr "length" "*,28,*")
+ (set_attr "arch" "nota8,*,onlya8")]
+)
+
+(define_expand "lshrdi3_neon"
+ [(match_operand:DI 0 "s_register_operand" "")
+ (match_operand:DI 1 "s_register_operand" "")
+ (match_operand:DI 2 "shift_amount_64" "")]
+ "TARGET_NEON"
+{
+ rtx neg = gen_reg_rtx (DImode);
+ if (REG_P (operands[2]))
+ {
+ emit_insn (gen_negdi2 (neg, operands[2]));
+ emit_insn (gen_lshrdi3_neon_reg (operands[0], operands[1], neg));
+ }
+ else
+ emit_insn (gen_lshrdi3_neon_imm (operands[0], operands[1], operands[2]));
+ DONE;
+})
+
+;; Split all kinds constant 64-bit shifts, up to 31 bits
+(define_split
+ [(set (match_operand:DI 0 "s_register_operand" "")
+ (match_operator:DI 3 "neon_shift_operator"
+ [(match_operand:DI 1 "s_register_operand" "")
+ (match_operand:DI 2 "int_0_to_31" "")]))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON && reload_completed && !(IS_VFP_REGNUM (REGNO (operands[0])))"
+ [(set (match_dup 4) (match_op_dup 9 [(match_dup 6) (match_dup 2)]))
+ (set (match_dup 4) (ior:SI (match_op_dup 10 [(match_dup 7) (match_dup 8)]) (match_dup 4)))
+ (set (match_dup 5) (match_op_dup 3 [(match_dup 7) (match_dup 2)]))]
+ "
+ {
+ enum rtx_code firstshift;
+ enum rtx_code reverseshift;
+ enum rtx_code lastshift = GET_CODE (operands[3]);
+
+ /* There are patterns in arm.md for 1-bit shifts. */
+ if (INTVAL (operands[2]) == 1)
+ FAIL;
+
+ switch (lastshift)
+ {
+ case ASHIFT:
+ operands[4] = gen_highpart (SImode, operands[0]);
+ operands[5] = gen_lowpart (SImode, operands[0]);
+ operands[6] = gen_highpart( SImode, operands[1]);
+ operands[7] = gen_lowpart (SImode, operands[1]);
+ firstshift = ASHIFT;
+ reverseshift = LSHIFTRT;
+ break;
+ case ASHIFTRT:
+ case LSHIFTRT:
+ operands[4] = gen_lowpart (SImode, operands[0]);
+ operands[5] = gen_highpart (SImode, operands[0]);
+ operands[6] = gen_lowpart (SImode, operands[1]);
+ operands[7] = gen_highpart( SImode, operands[1]);
+ firstshift = LSHIFTRT;
+ reverseshift = ASHIFT;
+ break;
+ default:
+ gcc_unreachable ();
+ }
+
+ operands[8] = gen_rtx_CONST_INT (VOIDmode, 32 - INTVAL (operands[2]));
+ operands[9] = gen_rtx_fmt_ee (firstshift, SImode, const0_rtx, const0_rtx);
+ operands[10] = gen_rtx_fmt_ee (reverseshift, SImode, const0_rtx, const0_rtx);
+ operands[3] = gen_rtx_fmt_ee (lastshift, SImode, const0_rtx, const0_rtx);
+ }")
+
+;; Split all kinds constant 64-bit shifts, over 31 bits
+(define_split
+ [(set (match_operand:DI 0 "s_register_operand" "")
+ (match_operator:DI 3 "neon_shift_operator"
+ [(match_operand:DI 1 "s_register_operand" "")
+ (match_operand:DI 2 "int_32_to_63" "")]))
+ (clobber (reg:CC CC_REGNUM))]
+ "TARGET_NEON && reload_completed && !(IS_VFP_REGNUM (REGNO (operands[0])))"
+ [(set (match_dup 4) (match_op_dup 3 [(match_dup 6) (match_dup 7)]))
+ (set (match_dup 5) (const_int 0))]
+ "
+ {
+ enum rtx_code code = GET_CODE (operands[3]);
+ operands[3] = gen_rtx_fmt_ee (code, SImode, const0_rtx, const0_rtx);
+
+ switch (code)
+ {
+ case ASHIFT:
+ operands[4] = gen_highpart (SImode, operands[0]);
+ operands[5] = gen_lowpart (SImode, operands[0]);
+ operands[6] = gen_lowpart (SImode, operands[1]);
+ operands[7] = gen_rtx_CONST_INT (VOIDmode, INTVAL (operands[2]) - 32);
+ break;
+ case ASHIFTRT:
+ case LSHIFTRT:
+ operands[4] = gen_lowpart (SImode, operands[0]);
+ operands[5] = gen_highpart (SImode, operands[0]);
+ operands[6] = gen_highpart (SImode, operands[1]);
+ operands[7] = gen_rtx_CONST_INT (VOIDmode, INTVAL (operands[2]) - 32);
+ break;
+ default:
+ gcc_unreachable ();
+ }
+ }")
+
;; Widening operations
(define_insn "widen_ssum<mode>3"
--- a/gcc/config/arm/predicates.md
+++ b/gcc/config/arm/predicates.md
@@ -248,6 +248,12 @@
&& ((unsigned HOST_WIDE_INT) INTVAL (XEXP (op, 1)) <= 32)")
(match_test "mode == GET_MODE (op)")))
+;; NEON 64-bit shifts are a little more limited.
+;; This is only used for constant shifts anyway.
+(define_special_predicate "neon_shift_operator"
+ (and (match_code "ashift,ashiftrt,lshiftrt")
+ (match_test "mode == GET_MODE (op)")))
+
;; True for MULT, to identify which variant of shift_operator is in use.
(define_special_predicate "mult_operator"
(match_code "mult"))
@@ -764,3 +770,19 @@
(define_special_predicate "add_operator"
(match_code "plus"))
+
+(define_predicate "int_0_to_63"
+ (and (match_code "const_int")
+ (match_test "IN_RANGE (INTVAL (op), 0, 63)")))
+
+(define_predicate "int_0_to_31"
+ (and (match_code "const_int")
+ (match_test "IN_RANGE (INTVAL (op), 0, 31)")))
+
+(define_predicate "int_32_to_63"
+ (and (match_code "const_int")
+ (match_test "IN_RANGE (INTVAL (op), 32, 63)")))
+
+(define_predicate "shift_amount_64"
+ (ior (match_operand 0 "s_register_operand")
+ (match_operand 0 "int_0_to_63")))
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -4403,33 +4403,35 @@
;; Zero and sign extension instructions.
(define_insn "zero_extend<mode>di2"
- [(set (match_operand:DI 0 "s_register_operand" "=r")
+ [(set (match_operand:DI 0 "s_register_operand" "=w, r")
(zero_extend:DI (match_operand:QHSI 1 "<qhs_zextenddi_op>"
"<qhs_zextenddi_cstr>")))]
"TARGET_32BIT <qhs_zextenddi_cond>"
"#"
- [(set_attr "length" "8")
- (set_attr "ce_count" "2")
- (set_attr "predicable" "yes")]
+ [(set_attr "length" "8,8")
+ (set_attr "ce_count" "2,2")
+ (set_attr "predicable" "yes,yes")]
)
(define_insn "extend<mode>di2"
- [(set (match_operand:DI 0 "s_register_operand" "=r")
+ [(set (match_operand:DI 0 "s_register_operand" "=w,r")
(sign_extend:DI (match_operand:QHSI 1 "<qhs_extenddi_op>"
"<qhs_extenddi_cstr>")))]
"TARGET_32BIT <qhs_sextenddi_cond>"
"#"
- [(set_attr "length" "8")
- (set_attr "ce_count" "2")
- (set_attr "shift" "1")
- (set_attr "predicable" "yes")]
+ [(set_attr "length" "8,8")
+ (set_attr "ce_count" "2,2")
+ (set_attr "shift" "1,1")
+ (set_attr "predicable" "yes,yes")]
)
;; Splits for all extensions to DImode
(define_split
[(set (match_operand:DI 0 "s_register_operand" "")
(zero_extend:DI (match_operand 1 "nonimmediate_operand" "")))]
- "TARGET_32BIT"
+ "TARGET_32BIT && (!TARGET_NEON
+ || (reload_completed
+ && !(IS_VFP_REGNUM (REGNO (operands[0])))))"
[(set (match_dup 0) (match_dup 1))]
{
rtx lo_part = gen_lowpart (SImode, operands[0]);
@@ -4455,7 +4457,9 @@
(define_split
[(set (match_operand:DI 0 "s_register_operand" "")
(sign_extend:DI (match_operand 1 "nonimmediate_operand" "")))]
- "TARGET_32BIT"
+ "TARGET_32BIT && (!TARGET_NEON
+ || (reload_completed
+ && !(IS_VFP_REGNUM (REGNO (operands[0])))))"
[(set (match_dup 0) (ashiftrt:SI (match_dup 1) (const_int 31)))]
{
rtx lo_part = gen_lowpart (SImode, operands[0]);
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -405,8 +405,8 @@
(define_mode_attr qhs_extenddi_op [(SI "s_register_operand")
(HI "nonimmediate_operand")
(QI "arm_reg_or_extendqisi_mem_op")])
-(define_mode_attr qhs_extenddi_cstr [(SI "r") (HI "rm") (QI "rUq")])
-(define_mode_attr qhs_zextenddi_cstr [(SI "r") (HI "rm") (QI "rm")])
+(define_mode_attr qhs_extenddi_cstr [(SI "r,r") (HI "r,rm") (QI "r,rUq")])
+(define_mode_attr qhs_zextenddi_cstr [(SI "r,r") (HI "r,rm") (QI "r,rm")])
;; Mode attributes used for fixed-point support.
(define_mode_attr qaddsub_suf [(V4UQQ "8") (V2UHQ "16") (UQQ "8") (UHQ "16")
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5818,3 +5818,25 @@
(const_string "neon_fp_vadd_qqq_vabs_qq"))
(const_string "neon_int_5")))]
)
+
+;; Copy from core-to-neon regs, then extend, not vice-versa
+
+(define_split
+ [(set (match_operand:DI 0 "s_register_operand" "")
+ (sign_extend:DI (match_operand:SI 1 "s_register_operand" "")))]
+ "TARGET_NEON && reload_completed && IS_VFP_REGNUM (REGNO (operands[0]))"
+ [(set (match_dup 0) (vec_duplicate:V2SI (match_dup 1)))
+ (parallel [(set (match_dup 0) (ashiftrt:DI (match_dup 0) (const_int 32)))
+ (clobber (reg:CC CC_REGNUM))])])
+
+(define_split
+ [(set (match_operand:DI 0 "s_register_operand" "")
+ (zero_extend:DI (match_operand:SI 1 "s_register_operand" "")))]
+ "TARGET_NEON && reload_completed && IS_VFP_REGNUM (REGNO (operands[0]))"
+ [(set (match_dup 2) (vec_duplicate:V2SI (match_dup 1)))
+ (parallel [(set (match_dup 0) (lshiftrt:DI (match_dup 0) (const_int 32)))
+ (clobber (reg:CC CC_REGNUM))])]
+ "
+ {
+ operands[2] = gen_rtx_REG (V2SImode, REGNO (operands[0]));
+ }")