https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115886

            Bug ID: 115886
           Summary: 4 different ways of implementing concat with stride
                    produce different results
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
            Target: aarch64

Take:
```
#define vec4 __attribute__((vector_size(4)))
#define vec8 __attribute__((vector_size(8)))
#define vec16 __attribute__((vector_size(16)))

#if defined(__clang__) || __GNUC__ >= 12
vec8 unsigned char
concat (unsigned char * pix1)
{
  vec4 int t = *(vec4 int*)&pix1[0];
  vec4 int t_ = *(vec4 int*)&pix1[16];
  vec8 int t_t = __builtin_shufflevector (t, t_, 0, 1);//, 2, 3, 4, 5, 6, 7);
  vec8 unsigned char t_t_ = (vec8 unsigned char)t_t;
  return t_t_;
}


vec8 unsigned char
concat2 (unsigned char * pix1)
{
  vec4 char t = *(vec4 char*)&pix1[0];
  vec4 char t_ = *(vec4 char*)&pix1[16];
  vec8 char t_t = __builtin_shufflevector (t, t_, 0, 1, 2, 3, 4, 5, 6, 7);
  vec8 unsigned char t_t_ = (vec8 unsigned char)t_t;
  return t_t_;
}

#endif

vec8 unsigned char
concat3 (unsigned char * pix1)
{
  int t = *(int*)&pix1[0];
  int t_ = *(int*)&pix1[16];
  vec8 int t_t = {t, t_};
  vec8 unsigned char t_t_ = (vec8 unsigned char)t_t;
  return t_t_;
}

vec8 unsigned char
concat4 (unsigned char * pix1)
{
  int t = *(int*)&pix1[0];
  int t_ = *(int*)&pix1[16];
  vec8 int t1 = {t, 0};
  t1[1] = t_;
  return (vec8 unsigned char)t1;
}
```

All 4 functions should produce the same assembly.

LLVM produces:
```
        ldr     s0, [x0]
        add     x8, x0, #16
        ld1     { v0.s }[1], [x8]
        ret
```
For all 4. While GCC produces that only for concat4.
The other 2 (concat and concat3) GCC produces:
```
concat:
        ldr     s31, [x0]
        ldr     s0, [x0, 16]
        zip1    v0.2s, v31.2s, v0.2s
        ret
...
concat3:
        ldr     s0, [x0]
        ldr     s31, [x0, 16]
        uzp1    v0.2s, v0.2s, v31.2s
        ret
```

Note ignore concat2 for now since it requires V4QI support and that is not
implemented yet.

I don't know if using ld1 or zip1 is better here though.

Note this shows up in spec (x264) (see PR 115252 there)


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

Reply via email to