Hi Juzhe, > Case 1: > void > f (uint8_t *restrict a, uint8_t *restrict b) > { > for (int i = 0; i < 100; ++i) > { > a[i * 8] = b[i * 8 + 37] + 1; > a[i * 8 + 1] = b[i * 8 + 37] + 2; > a[i * 8 + 2] = b[i * 8 + 37] + 3; > a[i * 8 + 3] = b[i * 8 + 37] + 4; > a[i * 8 + 4] = b[i * 8 + 37] + 5; > a[i * 8 + 5] = b[i * 8 + 37] + 6; > a[i * 8 + 6] = b[i * 8 + 37] + 7; > a[i * 8 + 7] = b[i * 8 + 37] + 8; > } > } > > We need to generate the stepped vector: > NPATTERNS = 8. > { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8 } > > Before this patch: > vid.v v4 ;; {0,1,2,3,4,5,6,7,...} > vsrl.vi v4,v4,3 ;; {0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,...} > li a3,8 ;; {8} > vmul.vx v4,v4,a3 ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...} > > After this patch: > vid.v v4 ;; {0,1,2,3,4,5,6,7,...} > vand.vi v4,v4,-8(-NPATTERNS) ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}
This is a nice improvement. Even though we're in the SLP realm I would still add an assert that documents that we're indeed operating with pow2_p (NPATTERNS) and some comment as to why we can use AND. Sure we're doing exact_log2 et al later anyway, just to make things clearer. > Before this patch: > li a6,134221824 > slli a6,a6,5 > addi a6,a6,3 ;; 64-bit: 0x0003000200010000 > vmv.v.x v6,a6 ;; {3, 2, 1, 0, ... } > vid.v v4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... } > vsrl.vi v4,v4,2 ;; {0, 0, 0, 0, 1, 1, 1, 1, ... } > li a3,4 ;; {4} > vmul.vx v4,v4,a3 ;; {0, 0, 0, 0, 4, 4, 4, 4, ... } > vadd.vv v4,v4,v6 ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, > 12, ... } > > After this patch: > li a3,-536875008 > slli a3,a3,4 > addi a3,a3,1 > slli a3,a3,16 > vmv.v.x v2,a3 ;; {3, 1, -1, -3, ... } > vid.v v4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... } > vadd.vv v4,v4,v2 ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, > 14, 13, 12, ... } My immediate idea would have been to fall back to the first approach, i.e. create the "0x00030002..." constant > li a6,134221824 > slli a6,a6,5 > addi a6,a6,3 ;; 64-bit: 0x0003000200010000 > vmv.v.x v6,a6 ;; {3, 2, 1, 0, ... } and then vid.v v4 vand.vi v4, v4, -4 vadd.vv v4, v4, v6 It's one more vector instruction though so possibly worse from a latency standpoint. Rest looks good to me. Regards Robin