Hi Juzhe,

> Case 1:
> void
> f (uint8_t *restrict a, uint8_t *restrict b)
> {
>   for (int i = 0; i < 100; ++i)
>     {
>       a[i * 8] = b[i * 8 + 37] + 1;
>       a[i * 8 + 1] = b[i * 8 + 37] + 2;
>       a[i * 8 + 2] = b[i * 8 + 37] + 3;
>       a[i * 8 + 3] = b[i * 8 + 37] + 4;
>       a[i * 8 + 4] = b[i * 8 + 37] + 5;
>       a[i * 8 + 5] = b[i * 8 + 37] + 6;
>       a[i * 8 + 6] = b[i * 8 + 37] + 7;
>       a[i * 8 + 7] = b[i * 8 + 37] + 8;
>     }
> }
> 
> We need to generate the stepped vector:
> NPATTERNS = 8.
> { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8 }
> 
> Before this patch:
> vid.v    v4         ;; {0,1,2,3,4,5,6,7,...}
> vsrl.vi  v4,v4,3    ;; {0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,...}
> li       a3,8       ;; {8}
> vmul.vx  v4,v4,a3   ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}
> 
> After this patch:
> vid.v    v4                    ;; {0,1,2,3,4,5,6,7,...}
> vand.vi  v4,v4,-8(-NPATTERNS)  ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...}

This is a nice improvement.  Even though we're in the SLP realm I would
still add an assert that documents that we're indeed operating with
pow2_p (NPATTERNS) and some comment as to why we can use AND.
Sure we're doing exact_log2 et al later anyway, just to make things
clearer.

> Before this patch:
> li       a6,134221824
> slli     a6,a6,5
> addi     a6,a6,3        ;; 64-bit: 0x0003000200010000
> vmv.v.x  v6,a6          ;; {3, 2, 1, 0, ... }
> vid.v    v4             ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vsrl.vi  v4,v4,2        ;; {0, 0, 0, 0, 1, 1, 1, 1, ... }
> li       a3,4           ;; {4}
> vmul.vx  v4,v4,a3       ;; {0, 0, 0, 0, 4, 4, 4, 4, ... }
> vadd.vv  v4,v4,v6       ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 
> 12, ... }
> 
> After this patch:
> li    a3,-536875008
> slli  a3,a3,4
> addi  a3,a3,1
> slli  a3,a3,16
> vmv.v.x       v2,a3           ;; {3, 1, -1, -3, ... }
> vid.v v4              ;; {0, 1, 2, 3, 4, 5, 6, 7, ... }
> vadd.vv       v4,v4,v2        ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 
> 14, 13, 12, ... }

My immediate idea would have been to fall back to the first
approach, i.e. create the "0x00030002..." constant

> li       a6,134221824
> slli     a6,a6,5
> addi     a6,a6,3        ;; 64-bit: 0x0003000200010000
> vmv.v.x  v6,a6          ;; {3, 2, 1, 0, ... }

and then
  vid.v v4
  vand.vi v4, v4, -4
  vadd.vv v4, v4, v6

It's one more vector instruction though so possibly worse from a latency
standpoint.

Rest looks good to me.

Regards
 Robin

Reply via email to