https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
Hao Liu <hliu at amperecomputing dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hliu at amperecomputing dot com --- Comment #4 from Hao Liu <hliu at amperecomputing dot com> --- It seems Richard Biener's patch (r272843) can remove the redundant load/store. r272843 comments as following: > 2019-07-01 Richard Biener <rguent...@suse.de> > > * tree-ssa-sccvn.c (class pass_fre): Add may_iterate > pass parameter. > (pass_fre::execute): Honor it. > * passes.def: Adjust pass_fre invocations to allow iterating, > add non-iterating pass_fre before late threading/dom. > > * gcc.dg/tree-ssa/pr77445-2.c: Adjust. Tested with Jiangning's case with "gcc -O3", the following code is generated: test_slp: .LFB0: .cfi_startproc adrp x1, .LC0 ldr q0, [x0] ldr q1, [x1, #:lo12:.LC0] tbl v0.16b, {v0.16b}, v1.16b uxtl v1.8h, v0.8b uxtl2 v0.8h, v0.16b uxtl v4.4s, v1.4h uxtl v2.4s, v0.4h uxtl2 v0.4s, v0.8h uxtl2 v1.4s, v1.8h dup s21, v4.s[0] dup s22, v2.s[1] dup s3, v0.s[1] dup s6, v1.s[0] dup s23, v4.s[1] dup s16, v2.s[0] add v3.2s, v3.2s, v22.2s dup s20, v0.s[0] dup s17, v1.s[1] dup s5, v0.s[2] fmov w0, s3 add v3.2s, v6.2s, v21.2s dup s19, v2.s[2] add v17.2s, v17.2s, v23.2s dup s7, v4.s[2] fmov w1, s3 add v3.2s, v16.2s, v20.2s dup s18, v1.s[2] fmov w3, s17 dup s2, v2.s[3] fmov w2, s3 add v3.2s, v5.2s, v19.2s dup s0, v0.s[3] dup s4, v4.s[3] add w0, w0, w3 dup s1, v1.s[3] fmov w3, s3 add v3.2s, v7.2s, v18.2s add v0.2s, v2.2s, v0.2s add w1, w1, w2 add w0, w0, w1 fmov w2, s3 add w3, w3, w2 fmov w2, s0 add v0.2s, v1.2s, v4.2s add w0, w0, w3 fmov w1, s0 add w1, w2, w1 add w0, w0, w1 ret Although SLP still generates SIMD code, it looks much better than previous code with memory load/store. Performance is expected to be better as no redundant load/store.