On 14/06/2023 15:29, Richard Biener wrote:


Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <a...@codesourcery.com>:

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).

This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands 
to a vector compare against a vector of stepped values. This requires an additional 
instruction to prepare the comparison vector (compared to SVE), but the 
"while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it 
works reasonably well.

Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size   scalar     128     256     512    512e    512f
     1    9.42   11.32    9.35   11.17   15.13   16.89
     2    5.72    6.53    6.66    6.66    7.62    8.56
     3    4.49    5.10    5.10    5.74    5.08    5.73
     4    4.10    4.33    4.29    5.21    3.79    4.25
     6    3.78    3.85    3.86    4.76    2.54    2.85
     8    3.64    1.89    3.76    4.50    1.92    2.16
    12    3.56    2.21    3.75    4.26    1.26    1.42
    16    3.36    0.83    1.06    4.16    0.95    1.07
    20    3.39    1.42    1.33    4.07    0.75    0.85
    24    3.23    0.66    1.72    4.22    0.62    0.70
    28    3.18    1.09    2.04    4.20    0.54    0.61
    32    3.16    0.47    0.41    0.41    0.47    0.53
    34    3.16    0.67    0.61    0.56    0.44    0.50
    38    3.19    0.95    0.95    0.82    0.40    0.45
    42    3.09    0.58    1.21    1.13    0.36    0.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.

Let me check I understand correctly. In the fully masked case, there is a 
single loop in which a new mask is generated at the start of each iteration. In 
the masked epilogue case, the main loop uses no masking whatsoever, thus 
avoiding the need for generating a mask, carrying the mask, inserting vec_merge 
operations, etc, and then the epilogue looks much like the fully masked case, 
but unlike smaller mode epilogues there is no loop because the eplogue vector 
size is the same. Is that right?

Yes.

This scheme seems like it might also benefit GCN, in so much as it simplifies 
the hot code path.

GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 
(we pretend we have some smaller sizes, but that's because the middle end can't 
do masking everywhere yet, and it helps make some vector constants smaller, 
perhaps).

This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.
I have decided against interweaving vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
I was split between making 'vec_loop_masks' a class with methods,
possibly merging in the _len stuff into a single registry.  It
seemed to be too many changes for the purpose of getting AVX512
working.  I'm going to play wait and see what happens with RISC-V
here since they are going to get both masks and lengths registered
I think.
The vect_prepare_for_masked_peels hunk might run into issues with
SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
looked odd.
Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
the testsuite with --param vect-partial-vector-usage=2 with and
without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
and one latent wrong-code (PR110237).
There's followup work to be done to try enabling masked epilogues
for x86-64 by default (when AVX512 is enabled, possibly only when
-mprefer-vector-width=512).  Getting cost modeling and decision
right is going to be challenging.
Any comments?
OK?
Btw, testing on GCN would be welcome - the _avx512 paths could
work for it so in case the while_ult path fails (not sure if
it ever does) it could get _avx512 style masking.  Likewise
testing on ARM just to see I didn't break anything here.
I don't have SVE hardware so testing is probably meaningless.

I can set some tests going. Is vect.exp enough?

Well, only you know (from experience), but sure that’s a nice start.

I tested vect.exp for both gcc and gfortran and there were no regressions. I have another run going with the other param settings.

(Side note: vect.exp used to be a nice quick test for use during development, but the tsvc tests are now really slow, at least when run on a single GPU thread.)

I tried some small examples with --param vect-partial-vector-usage=1 (IIUC this prevents masked loops, but not masked epilogues, right?) and the results look good. I plan to do some benchmarking shortly. One comment: building a vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the constant pool and an unnecessary memory load (it literally has to use this sequence to generate the addresses to load the constant!) Generating the sequence via VEC_SERIES would be a no-op, for GCN, because we have an ABI-mandated register that already holds that value. (Perhaps I have another piece missing here, IDK?)

Andrew

Reply via email to