On Fri, May 31, 2024 at 10:58 AM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi,
> I've recently been trying to hand-write code to trigger automatic
> vectorization optimizations in GCC on Intel x86 machines (without
> using the interfaces in immintrin.h), but I'm running into a problem
> where I can't seem to get the concise `vpmovzxbd` or similar
> instructions.
>
> My requirement is to convert 8 `uint8_t` elements to `int32_t` type
> and print the output. If I use the interface (_mm256_cvtepu8_epi32) in
> immintrin.h, the code is as follows:
>
> int immintrin () {
>     int size = 10000, offset = 3;
>     uint8_t* a = malloc(sizeof(char) * size);
>
>     __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset));
>
>     for (int i = 0; i < 8; i++) {
>         printf("%d\n", b[i]);
>     }
> }
>
> After compiling with -mavx2 -O3, you can get concise and efficient
> instructions. (You can see it here: https://godbolt.org/z/8ojzdav47)
>
> But if I do not use this interface and instead use a for-loop or the
> `__builtin_convertvector` interface provided by GCC, I cannot achieve
> the above effect. The code is as follows:
>
> typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8)));
> int forloop () {
>     int size = 10000, offset = 3;
>     uint8_t* a = malloc(sizeof(char) * size);
>
>     v8qiu av = *(v8qiu *)(a + offset);
>     __v8si b = {};
>     for (int i = 0; i < 8; i++) {
>         b[i] = (a + offset)[i];
>     }
>
>     for (int i = 0; i < 8; i++) {
>         printf("%d\n", b[i]);
>     }
> }
>
> int builtin_cvt () {
>     int size = 10000, offset = 3;
>     uint8_t* a = malloc(sizeof(char) * size);
>
>     v8qiu av = *(v8qiu *)(a + offset);
>     __v8si b = __builtin_convertvector(av, __v8si);
>
>     for (int i = 0; i < 8; i++) {
>         printf("%d\n", b[i]);
>     }
> }
>
> The instructions generated by both functions are redundant and
> complex, and are quite difficult to read compared to calling
> `_mm256_cvtepu8_epi32` directly. (You can see it here as well:
> https://godbolt.org/z/8ojzdav47)
>
> What I want to ask is: How should I write the source code to get
> assembly instructions similar to directly calling
> _mm256_cvtepu8_epi32?
>
> Or would it be easier if I modified the GIMPLE directly? But it seems
> that there is no relevant expression or interface directly
> corresponding to `vpmovzxbd` in GIMPLE.
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652484.html
We're working on the patch to optimize __builtin_convertvector, after
that it can be as optimal as intel intrinsic.
>
> Thanks
> Hanke Zhang



-- 
BR,
Hongtao

Reply via email to