Issue 152430
Summary Missed optimization and unexpected compilation difference when using inlined function to cast int32_t to uint8_t
Labels new issue
Assignees
Reporter kasper93
    Hi,

I didn't know what would be good title for this issue. Basically what happens is that `uint8_t min = mpeg_min;` where `mpeg_min` is `int32_t` is ignored for vectorization unless it is done via `static inline` function call... which is fully inlined.

The issue is that version vectorized using `uint8_t` is significantly faster than the one using `int32_t` and I had to add dummy static inline function to make llvm produce better code. Which is unexpected, because for what it's worth, both version should compile to the same code. I tried `__builtin_assume` doesn't change anything, except initial cast. I assume that LLVM decided to ignore the variable range, even if explicitly narrowed and use `int32_t` version directly, which hinders performance.

If you look at IR, we can see:

That fast version is doing vectorization on i8
```
 %broadcast.splatinsert = insertelement <16 x i8> poison, i8 %conv, i64 0
```

While slow version is doing things on i32
``` LLVM
 %broadcast.splatinsert = insertelement <16 x i32> poison, i32 %mpeg_min, i64 0
```

This is just one line, take a look at full output.

Code here https://godbolt.org/z/aeG8zbE9x with diff and attached below.

Compiled with `clang -O3`. 

## Fast version
``` c
#include <stddef.h>
#include <stdint.h>

static inline int fast_impl(const uint8_t *data, ptrdiff_t stride,
                           ptrdiff_t width, ptrdiff_t height,
 uint8_t mpeg_min, uint8_t mpeg_max)
{
    while (height--) {
        uint8_t cond = 0;
        for (int x = 0; x < width; x++) {
            const uint8_t val = data[x];
            cond |= val < mpeg_min || val > mpeg_max;
        }
        if (cond)
            return 1;
        data += stride;
    }
    return 0;
}

int fast(const uint8_t *data, ptrdiff_t stride,
         ptrdiff_t width, ptrdiff_t height,
         int mpeg_min, int mpeg_max)
{
 __builtin_assume(mpeg_min >= 0 && mpeg_min <= UINT8_MAX);
 __builtin_assume(mpeg_max >= 0 && mpeg_max <= UINT8_MAX);
    return foo_impl(data, stride, width, height, mpeg_min, mpeg_max);
}
```

## Slow version
``` c
#include <stddef.h>
#include <stdint.h>

int slow(const uint8_t *data, ptrdiff_t stride, 
         ptrdiff_t width, ptrdiff_t height,
         int32_t mpeg_min, int32_t mpeg_max)
{
 __builtin_assume(mpeg_min >= 0 && mpeg_min <= UINT8_MAX);
 __builtin_assume(mpeg_max >= 0 && mpeg_max <= UINT8_MAX);
    uint8_t min = mpeg_min;
    uint8_t max = mpeg_max;
    while (height--) {
 uint8_t cond = 0;
        for (int x = 0; x < width; x++) {
 const uint8_t val = data[x];
            cond |= val < min || val > max;
 }
        if (cond)
            return 1;
        data += stride;
 }
    return 0;
}
```

Thanks,
Kacper
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to