On Sun, Jan 1, 2023 at 3:54 PM Nikita Zlobin via Gcc <gcc@gcc.gnu.org> wrote: > > Vector extension is great, because allowes to use controllable > vectorization without dealing with each SIMD ISA separately. When > properly used, it allowes to get better performance, than with > auto-vectorization. However, there's just one issue. > > While for specific SIMD, used as backends for vec-ext, it's possible > to check if they are supported, there's no similar features for vector > extension. The only way yo make it configurable without manually > checking each ISA, is to e.g. add configure parameter > --vector-size=<bytes>, with enough goot commentary for user to > understand, what should be there (to be specified in __attributes__(( > bytes )) ). > > My first approach was to check for possibility to make autodetected > config, e.g. with autoconf, ins such way (not ideal, just for start): > > gcc -march=native -E -v - < /dev/null 2>&1 | awk 'BEGIN{ arr[0]=0; > delete arr[0]; } /cc1/{ for (i=1; i<=NF; i++){ if ($i ~ /-mno-/) > continue; switch ($i){ case /-m(mmx|3dnow|vis)/: arr[8]=1; break; case > /-m(sse|altivec)/: arr[16]=1; break; case /^-mavx[2]?$/: arr[32]=1; > break; case /-mavx-512/: arr[64]=1; break; } }; for (j in arr) print > j; }'
There's -Wvector-operation-performance which will diagnose cases where GCC decomposes larger into smaller vectors or even to scalar operations. That might be of some help here as well. > However, I discovered, that I have no idea, how to detect NEON vector > size in this way (even its presence). There was answer, suggesting to > check feature test macros. After trying this command: > > gcc -march=native -dM -E - </dev/null | less > > I discovered, that other ISA, like MMX, SSE and AVX, have similar > feature test macroses, e.g. __MMX__, __SSE2__, __AVX__. This means, > that simple C header with __GNU_SOURCE, would be enough to check for > each ISA without calling functions from Target Builtins extension. > > However, it's not end. Some ISA have limited set of elementary types > to be used in vectors. E.g., MMX and 3DNow! don't support integer. > This may be issue if integer implementation of some code has better > performance than if using floating point format (even with same data > width). This neccesitates for real feature test macroses, representing > data types, supported by supported SIMD ISA. > > E.g., for simple vector sizes - it could be done with array (example): > > #define __EXT_VECTOR_SIZEV (int[]){64, 128, 256, 512} > > with array len determined as sizeof(vec) / sizeof(vec[0]) > > But for exact check of supported data types - there could be variants: > > 1. Using per-type feature test macroses: __V8SI16__, __V8UI16__, > __V8F16__, __V4SI32__, __V4UI32__, __V4F32__, __V2SI64__, > __V2F64__.... > (I discovered at wikipedia - some ISA restrict underlying int size to > 32bit without 64bit support). > > 2. Extend array for supported lengths to be 2d matrix of supported > vector size + underlying element type combination. This could use > NULL-terminated array to mark end if real values sequence. First > subarray represents vector sizes, while next subarrays each correspond > to value from first. Their elements are int fields, combining bitwidth > value with bit flags, representing if it's float/int and (for int) > signed/unsigned. > > Though who knowes if eventually complex numbers could have chance to > appear in this list :D . Well, even without this this could be tricky > way. > > 3. There could be variation of 2nd way, representing per-type vector > sizes lists rather than per-vector-size data types. This could be more > practical, since algothythms would rather need available vector sizes > for specific data types, used inside. > > As for relying for vector size subdivision when it has no > corresponding ISA support - I got only worse performance in this way. > Although I'm not sure, that it's not gcc bug: if there are 2 > subvectors existing at the same time, than it could be just too much > SIMD registers used. While if they are processed in sequence, this > probably should not worsen performance (I never tried manual code > intrinsics). In general you'll figure that writing generic vector code is as hard as autovectorizing scalar code... Richard.