> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Wednesday, June 14, 2023 10:30 PM > To: Andrew Stubbs <a...@codesourcery.com> > Cc: gcc-patches@gcc.gnu.org; richard.sandif...@arm.com; Jan Hubicka > <hubi...@ucw.cz>; Liu, Hongtao <hongtao....@intel.com>; > kirill.yuk...@gmail.com > Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization > > > > > Am 14.06.2023 um 16:27 schrieb Andrew Stubbs > <a...@codesourcery.com>: > > > > On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: > >> This implemens fully masked vectorization or a masked epilog for > >> AVX512 style masks which single themselves out by representing each > >> lane with a single bit and by using integer modes for the mask (both > >> is much like GCN). > >> AVX512 is also special in that it doesn't have any instruction to > >> compute the mask from a scalar IV like SVE has with while_ult. > >> Instead the masks are produced by vector compares and the loop > >> control retains the scalar IV (mainly to avoid dependences on mask > >> generation, a suitable mask test instruction is available). > > > > This is also sounds like GCN. We currently use WHILE_ULT in the middle end > which expands to a vector compare against a vector of stepped values. This > requires an additional instruction to prepare the comparison vector > (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns > the DImode bitmask, so it works reasonably well. > > > >> Like RVV code generation prefers a decrementing IV though IVOPTs > >> messes things up in some cases removing that IV to eliminate it with > >> an incrementing one used for address generation. > >> One of the motivating testcases is from PR108410 which in turn is > >> extracted from x264 where large size vectorization shows issues with > >> small trip loops. Execution time there improves compared to classic > >> AVX512 with AVX2 epilogues for the cases of less than 32 iterations. > >> size scalar 128 256 512 512e 512f > >> 1 9.42 11.32 9.35 11.17 15.13 16.89 > >> 2 5.72 6.53 6.66 6.66 7.62 8.56 > >> 3 4.49 5.10 5.10 5.74 5.08 5.73 > >> 4 4.10 4.33 4.29 5.21 3.79 4.25 > >> 6 3.78 3.85 3.86 4.76 2.54 2.85 > >> 8 3.64 1.89 3.76 4.50 1.92 2.16 > >> 12 3.56 2.21 3.75 4.26 1.26 1.42 > >> 16 3.36 0.83 1.06 4.16 0.95 1.07 > >> 20 3.39 1.42 1.33 4.07 0.75 0.85 > >> 24 3.23 0.66 1.72 4.22 0.62 0.70 > >> 28 3.18 1.09 2.04 4.20 0.54 0.61 > >> 32 3.16 0.47 0.41 0.41 0.47 0.53 > >> 34 3.16 0.67 0.61 0.56 0.44 0.50 > >> 38 3.19 0.95 0.95 0.82 0.40 0.45 > >> 42 3.09 0.58 1.21 1.13 0.36 0.40 > >> 'size' specifies the number of actual iterations, 512e is for a > >> masked epilog and 512f for the fully masked loop. From > >> 4 scalar iterations on the AVX512 masked epilog code is clearly the > >> winner, the fully masked variant is clearly worse and it's size > >> benefit is also tiny. > > > > Let me check I understand correctly. In the fully masked case, there is a > single loop in which a new mask is generated at the start of each iteration. > In > the masked epilogue case, the main loop uses no masking whatsoever, thus > avoiding the need for generating a mask, carrying the mask, inserting > vec_merge operations, etc, and then the epilogue looks much like the fully > masked case, but unlike smaller mode epilogues there is no loop because the > eplogue vector size is the same. Is that right? > > Yes. What about vectorizer and unroll, when vector size is the same, unroll factor is N, but there're at most N - 1 iterations for epilogue loop, will there still a loop? > > This scheme seems like it might also benefit GCN, in so much as it > > simplifies > the hot code path. > > > > GCN does not actually have smaller vector sizes, so there's no analogue to > AVX2 (we pretend we have some smaller sizes, but that's because the > middle end can't do masking everywhere yet, and it helps make some vector > constants smaller, perhaps). > > > >> This patch does not enable using fully masked loops or masked > >> epilogues by default. More work on cost modeling and vectorization > >> kind selection on x86_64 is necessary for this. > >> Implementation wise this introduces > LOOP_VINFO_PARTIAL_VECTORS_STYLE > >> which could be exploited further to unify some of the flags we have > >> right now but there didn't seem to be many easy things to merge, so > >> I'm leaving this for followups. > >> Mask requirements as registered by vect_record_loop_mask are kept in > >> their original form and recorded in a hash_set now instead of being > >> processed to a vector of rgroup_controls. Instead that's now left to > >> the final analysis phase which tries forming the rgroup_controls > >> vector using while_ult and if that fails now tries AVX512 style which > >> needs a different organization and instead fills a hash_map with the > >> relevant info. vect_get_loop_mask now has two implementations, one > >> for the two mask styles we then have. > >> I have decided against interweaving > >> vect_set_loop_condition_partial_vectors > >> with conditions to do AVX512 style masking and instead opted to > >> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. > >> Likewise for vect_verify_full_masking vs > vect_verify_full_masking_avx512. > >> I was split between making 'vec_loop_masks' a class with methods, > >> possibly merging in the _len stuff into a single registry. It seemed > >> to be too many changes for the purpose of getting AVX512 working. > >> I'm going to play wait and see what happens with RISC-V here since > >> they are going to get both masks and lengths registered I think. > >> The vect_prepare_for_masked_peels hunk might run into issues with > >> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE > >> looked odd. > >> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run the > >> testsuite with --param vect-partial-vector-usage=2 with and without > >> -fno-vect-cost-model and filed two bugs, one ICE (PR110221) and one > >> latent wrong-code (PR110237). > >> There's followup work to be done to try enabling masked epilogues for > >> x86-64 by default (when AVX512 is enabled, possibly only when > >> -mprefer-vector-width=512). Getting cost modeling and decision right > >> is going to be challenging. > >> Any comments? > >> OK? > >> Btw, testing on GCN would be welcome - the _avx512 paths could work > >> for it so in case the while_ult path fails (not sure if it ever does) > >> it could get _avx512 style masking. Likewise testing on ARM just to > >> see I didn't break anything here. > >> I don't have SVE hardware so testing is probably meaningless. > > > > I can set some tests going. Is vect.exp enough? > > Well, only you know (from experience), but sure that’s a nice start. > > Richard > > > Andrew > >
RE: [PATCH 3/3] AVX512 fully masked vectorization
Liu, Hongtao via Gcc-patches Wed, 14 Jun 2023 22:51:16 -0700
- [PATCH 3/3] AVX512 fully masked vectorizati... Richard Biener via Gcc-patches
- Re: [PATCH 3/3] AVX512 fully masked ve... Andrew Stubbs
- Re: [PATCH 3/3] AVX512 fully maske... Richard Biener via Gcc-patches
- RE: [PATCH 3/3] AVX512 fully m... Liu, Hongtao via Gcc-patches
- RE: [PATCH 3/3] AVX512 ful... Richard Biener via Gcc-patches
- Re: [PATCH 3/3] AVX512 fully m... Andrew Stubbs
- Re: [PATCH 3/3] AVX512 ful... Richard Biener via Gcc-patches
- Re: [PATCH 3/3] AVX51... Andrew Stubbs
- Re: [PATCH 3/3] A... Richard Biener via Gcc-patches
- Re: [PATCH 3/... Andrew Stubbs
- Re: [PATCH 3/... Richard Biener via Gcc-patches
- Re: [PATCH 3/... Andrew Stubbs
- Re: [PATCH 3/... Richard Biener via Gcc-patches
- Re: [PATCH 3/... Andrew Stubbs