https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684

            Bug ID: 116684
           Summary: [vectorization][x86-64] dot_16x1x16_uint8_int8_int32
                    could be better optimized
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: burnus at gcc dot gnu.org
  Target Milestone: ---
            Target: x86-64

Found at https://www.darpa.mil/attachments/MOCHA%20Proposers%20Day%20Slides.pdf
page 12:

"VeGen: ML generated code is more compact and efficient"
"A matrix-vector multiplication kernel from TVM"
("VeGen: Vectorizer Generator", "TVM: Tensor Virtual Machine")

The VeGen code they show is just:
------------------
vmovdqu64 zmm0, [rdx]
vpbroadcastd zmm1, [rdi]
vpdpbusd zmm0, zmm0, [rsi]
vmovdqu64 [rdx], zmm0
------------------

They do show snippets for GCC, LLVM and Intel, but those look similar but
different to the ones I get when trying with gcc, clang/llvm and icx (not icc);
in any case I got:
https://godbolt.org/z/KGaMvbx68
On Godbolt, the LLVM code is half as long as GCC's and Intel's about the same
size as GCCs - and all are way longer than te VeGen snippet shown above.


Disclaimer: For -Os, the VeGen is surely the best. But I have not tried to
check which of the generated codes is actually fastest.

And I don't know whether that this PR helps or not, but before it falls through
the cracks, I decided to submit it.


#include <stdint.h>

void
dot_16x1x16_uint8_int8_int32(
   uint8_t data[restrict 4],
   int8_t kernel[restrict 16][4],
   int32_t output[restrict 16])
{
  for (int i = 0; i < 16; i++)
    for (int k = 0; k < 4; k++)
      output[i] += data[k] * kernel[i][k];
}

Reply via email to