[Bug target/97875] suboptimal loop vectorization

clyon at gcc dot gnu.org via Gcc-bugs Wed, 09 Dec 2020 08:59:28 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875


--- Comment #5 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Interestingly, if I make arm_builtin_support_vector_misalignment() behave the
same for MVE and Neon, the generated code (with __restrict__) becomes:
test_vsub_i32:
        @ args = 0, pretend = 0, frame = 16
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        push    {r4, r5, r6, r7, r8, r9, r10, fp}       @ 61    [c=8 l=4] 
*push_multi
        ldrd    r10, fp, [r1, #8]       @ 75    [c=8 l=4]  *thumb2_ldrd
        ldrd    r6, r7, [r2, #8]        @ 76    [c=8 l=4]  *thumb2_ldrd
        ldr     r4, [r2]        @ 14    [c=12 l=4]  *thumb2_movsi_vfp/5
        ldr     r8, [r1]        @ 9     [c=12 l=4]  *thumb2_movsi_vfp/6
        ldr     r9, [r1, #4]    @ 10    [c=12 l=4]  *thumb2_movsi_vfp/6
        ldr     r5, [r2, #4]    @ 15    [c=12 l=4]  *thumb2_movsi_vfp/5
        vmov    d6, r8, r9  @ v4si      @ 35    [c=4 l=8]  *mve_movv4si/1
        vmov    d7, r10, fp
        vmov    d4, r4, r5  @ v4si      @ 36    [c=4 l=8]  *mve_movv4si/1
        vmov    d5, r6, r7
        sub     sp, sp, #16     @ 62    [c=4 l=4]  *arm_addsi3/11
        mov     r3, sp  @ 37    [c=4 l=2]  *thumb2_movsi_vfp/0
        vsub.i32        q3, q3, q2      @ 18    [c=80 l=4]  mve_vsubqv4si
        vstrw.32        q3, [r3]        @ 34    [c=4 l=4]  *mve_movv4si/7
        ldrd    r4, r1, [sp]    @ 77    [c=8 l=4]  *thumb2_ldrd_base
        ldrd    r2, r3, [sp, #8]        @ 78    [c=8 l=4]  *thumb2_ldrd
        strd    r4, r1, [r0]    @ 79    [c=8 l=4]  *thumb2_strd_base
        strd    r2, r3, [r0, #8]        @ 80    [c=8 l=4]  *thumb2_strd
        add     sp, sp, #16     @ 66    [c=4 l=4]  *arm_addsi3/5
        @ sp needed     @ 67    [c=8 l=0]  force_register_use
        pop     {r4, r5, r6, r7, r8, r9, r10, fp}       @ 68    [c=8 l=4] 
*load_multiple_with_writeback
        bx      lr      @ 69    [c=8 l=4]  *thumb2_return


The Neon version has:
        vld1.32 {q8}, [r1]      @ 8     [c=8 l=4]  *movmisalignv4si_neon_load
        vld1.32 {q9}, [r2]      @ 9     [c=8 l=4]  *movmisalignv4si_neon_load
        vsub.i32        q8, q8, q9      @ 10    [c=80 l=4]  *subv4si3_neon
        vst1.32 {q8}, [r0]      @ 11    [c=8 l=4]  *movmisalignv4si_neon_store
        bx      lr      @ 21    [c=8 l=4]  *thumb2_return

So it seems MVE needs movmisalign pattern.

[Bug target/97875] suboptimal loop vectorization

Reply via email to