On Sun, Jan 1, 2023 at 3:54 PM Nikita Zlobin via Gcc <gcc@gcc.gnu.org> wrote:
>
> Vector extension is great, because allowes to use controllable
> vectorization without dealing with each SIMD ISA separately. When
> properly used, it allowes to get better performance, than with
> auto-vectorization. However, there's just one issue.
>
> While for specific SIMD, used as backends for vec-ext, it's possible
> to check if they are supported, there's no similar features for vector
> extension. The only way yo make it configurable without manually
> checking each ISA, is to e.g. add configure parameter
> --vector-size=<bytes>, with enough goot commentary for user to
> understand, what should be there (to be specified in __attributes__((
> bytes )) ).
>
> My first approach was to check for possibility to make autodetected
> config, e.g. with autoconf, ins such way (not ideal, just for start):
>
> gcc -march=native -E -v - < /dev/null 2>&1 | awk 'BEGIN{ arr[0]=0;
> delete arr[0]; } /cc1/{ for (i=1; i<=NF; i++){ if ($i ~ /-mno-/)
> continue; switch ($i){ case /-m(mmx|3dnow|vis)/: arr[8]=1; break; case
> /-m(sse|altivec)/: arr[16]=1; break; case /^-mavx[2]?$/: arr[32]=1;
> break; case /-mavx-512/: arr[64]=1; break; } }; for (j in arr) print
> j; }'

There's -Wvector-operation-performance which will diagnose cases
where GCC decomposes larger into smaller vectors or even to scalar
operations.  That might be of some help here as well.

> However, I discovered, that I have no idea, how to detect NEON vector
> size in this way (even its presence). There was answer, suggesting to
> check feature test macros. After trying this command:
>
> gcc -march=native -dM -E - </dev/null | less
>
> I discovered, that other ISA, like MMX, SSE and AVX, have similar
> feature test macroses, e.g. __MMX__, __SSE2__, __AVX__. This means,
> that simple C header with __GNU_SOURCE, would be enough to check for
> each ISA without calling functions from Target Builtins extension.
>
> However, it's not end. Some ISA have limited set of elementary types
> to be used in vectors. E.g., MMX and 3DNow! don't support integer.
> This may be issue if integer implementation of some code has better
> performance than if using floating point format (even with same data
> width). This neccesitates for real feature test macroses, representing
> data types, supported by supported SIMD ISA.
>
> E.g., for simple vector sizes - it could be done with array (example):
>
> #define __EXT_VECTOR_SIZEV (int[]){64, 128, 256, 512}
>
> with array len determined as sizeof(vec) / sizeof(vec[0])
>
> But for exact check of supported data types - there could be variants:
>
> 1. Using per-type feature test macroses: __V8SI16__, __V8UI16__,
> __V8F16__, __V4SI32__, __V4UI32__, __V4F32__, __V2SI64__,
> __V2F64__....
> (I discovered at wikipedia - some ISA restrict underlying int size to
> 32bit without 64bit support).
>
> 2. Extend array for supported lengths to be 2d matrix of supported
> vector size + underlying element type combination. This could use
> NULL-terminated array to mark end if real values sequence. First
> subarray represents vector sizes, while next subarrays each correspond
> to value from first. Their elements are int fields, combining bitwidth
> value with bit flags, representing if it's float/int and (for int)
> signed/unsigned.
>
> Though who knowes if eventually complex numbers could have chance to
> appear in this list :D . Well, even without this this could be tricky
> way.
>
> 3. There could be variation of 2nd way, representing per-type vector
> sizes lists rather than per-vector-size data types. This could be more
> practical, since algothythms would rather need available vector sizes
> for specific data types, used inside.
>
> As for relying for vector size subdivision when it has no
> corresponding ISA support - I got only worse performance in this way.
> Although I'm not sure, that it's not gcc bug: if there are 2
> subvectors existing at the same time, than it could be just too much
> SIMD registers used. While if they are processed in sequence, this
> probably should not worsen performance (I never tried manual code
> intrinsics).

In general you'll figure that writing generic vector code is as hard
as autovectorizing scalar code...

Richard.

Reply via email to