I've been investigating performance inconsistencies of some vector code when compiled with clang.

Trying to boil it down to a minimal example, I'm puzzled by the following:

  #include <stdint.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <time.h>

  typedef uint32_t uint32x4_t __attribute__((vector_size(16)));
  static uint8_t buffer[1 << 28];

  int main(void) {
    CLASS uint32x4_t state = { 1, 2, 3, 4 };

    for (size_t i = 0; i < sizeof(buffer); i++)
      buffer[i] = i;

    double start = (double) clock() / CLOCKS_PER_SEC;
    for (size_t j = 0; j < sizeof(buffer) - 15; j += 16) {
      /* XOR in a chunk of buffer ignoring endianness: */
      for (uint8_t i = 0; i < 16; i++)
        ((uint8_t *) &state)[i] ^= buffer[j + i];
      /* Do some random vector work on top of each chunk: */
      state = state * state | (uint32x4_t) { 5, 5, 5, 5 };
    }

    double finish = (double) clock() / CLOCKS_PER_SEC;
    printf("%0.1f MB/s: ", sizeof(buffer) / (finish - start) / (1 << 20));
    printf("%08x,%08x,%08x,%08x\n", state[0], state[1], state[2], state[3]);
    return EXIT_SUCCESS;
  }


This program runs at a quarter of the speed compiled with -DCLASS= (so the state vector is automatic) compared to when it is compiled with -DCLASS=static (so the state vector is static):

  $ clang -Wall -DCLASS= -O3 test.c -o test && ./test
  819.7 MB/s: 71453c4d,b15df5e5,64535a1d,68709dd5

  $ clang -Wall -DCLASS=static -O3 test.c -o test && ./test
  3519.1 MB/s: 71453c4d,b15df5e5,64535a1d,68709dd5

It is also fast if the state vector is moved out to global/file scope. The behaviour is the same between two different x86-64 clangs on two different OSes:

  $ clang --version
  Alpine clang version 10.0.1
  Target: x86_64-alpine-linux-musl
  Thread model: posix
  InstalledDir: /usr/bin

  $ clang --version
  Apple LLVM version 10.0.1 (clang-1001.0.46.4)
  Target: x86_64-apple-darwin18.7.0
  Thread model: posix
  InstalledDir: /Library/Developer/CommandLineTools/usr/bin


I know the inner-16 loop is silly and could be written as a single vector XOR against a cast chunk of buffer, but this is heavily boiled-down. (The real code isn't really amenable to being transformed like that. It does a lot more vector work too, so the performance effect is more subtle.)

Despite the byte-wise inner-loop, the compiler does a superb job when the state vector is declared static or global, storing it in a vector register without writing to memory at all, and optimising the XOR to a single vector operation.

Is there any way I can coax it into compiling the auto state as efficiently as the static one? Is there something I've underspecified here, so I 'get lucky' in the one case?

Many thanks in advance for any help or pointers anyone can offer.

Best wishes,

Chris.


PS One thing I wonder is if there's there a cleaner way to access bytes of the vector as lvalues that will optimise more consistently, but I can't see an improvement that doesn't perform worse. For example, one experiment I tried was to replace uint32x4_t state with a union:

  static union {
    uint32x4_t u32;
    uint8x16_t u8;
  } state;

and use state.u8[i ^ 3] ^= ... to update the bytes instead of ((uint8_t *) &state)[i ^ 3] ^= ...

But this makes it always slow (like the auto case) instead of always fast (like the static case).

_______________________________________________
cfe-users mailing list
cfe-users@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-users

Reply via email to