Hello, I'd like to initiate discussion on vectorization of loops which boundaries are not aligned to VF. Main target for this optimization right now is x86's AVX-512, which features per-element embedded masking for all instructions. The main goal for this mail is to agree on overall design of the feature.
This approach was presented @ GNU Cauldron 2015 by Ilya Enkovich [1]. Here's a sketch of the algorithm: 1. Add check on basic stmts for masking: possibility to introduce index vector and corresponding mask 2. At the check if statements are vectorizable we additionally check if stmts need and can be masked and compute masking cost. Result is stored in `stmt_vinfo`. We are going to mask only mem. accesses, reductions and modify mask for already masked stmts (mask load, mask store and vect. condition) 3. Make a decision about masking: take computed costs and est. iterations count into consideration 4. Modify prologue/epilogue generation according decision made at analysis. Three options available: a. Use scalar remainder b. Use masked remainder. Won't be supported in first version c. Mask main loop 5.Support vectorized loop masking: - Create stmts for mask generation - Support generation of masked vector code (create generic vector code then patch it w/ masks) - Mask loads/stores/vconds/reductions only In first version (targeted v6) we're not going to support 4.b and loop mask pack/unpack. No `pack/unpack` means that masking will be supported only for types w/ the same size as index variable [1] - https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Vectorization+for+Intel+AVX-512.pdf What do you think? -- Thanks, K