Hello,
I'd like to initiate discussion on vectorization of loops which boundaries are 
not
aligned to VF. Main target for this optimization right now is x86's AVX-512, 
which
features per-element embedded masking for all instructions.
The main goal for this mail is to agree on overall design of the feature.

This approach was presented @ GNU Cauldron 2015 by Ilya Enkovich [1].
 
Here's a sketch of the algorithm:
  1. Add check on basic stmts for masking: possibility to introduce index 
vector and
     corresponding mask
  2. At the check if statements are vectorizable we additionally check if stmts 
     need and can be masked and compute masking cost. Result is stored in 
`stmt_vinfo`.
     We are going  to mask only mem. accesses, reductions and modify mask for 
already 
     masked stmts (mask load, mask store and vect. condition)
  3. Make a decision about masking: take computed costs and est. iterations 
count
     into consideration
  4. Modify prologue/epilogue generation according decision made at analysis. 
Three
     options available:
    a. Use scalar remainder
    b. Use masked remainder. Won't be supported in first version
    c. Mask main loop
  5.Support vectorized loop masking: 
    - Create stmts for mask generation
    - Support generation of masked vector code (create generic vector code then
      patch it w/ masks)
      -  Mask loads/stores/vconds/reductions only
 
In first version (targeted v6) we're not going to support 4.b and loop mask 
pack/unpack.
No `pack/unpack` means that masking will be supported only for types w/ the same
size as index variable
 
[1] - 
https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Vectorization+for+Intel+AVX-512.pdf

What do you think?

--
Thanks, K

Reply via email to