https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684
Bug ID: 116684 Summary: [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: burnus at gcc dot gnu.org Target Milestone: --- Target: x86-64 Found at https://www.darpa.mil/attachments/MOCHA%20Proposers%20Day%20Slides.pdf page 12: "VeGen: ML generated code is more compact and efficient" "A matrix-vector multiplication kernel from TVM" ("VeGen: Vectorizer Generator", "TVM: Tensor Virtual Machine") The VeGen code they show is just: ------------------ vmovdqu64 zmm0, [rdx] vpbroadcastd zmm1, [rdi] vpdpbusd zmm0, zmm0, [rsi] vmovdqu64 [rdx], zmm0 ------------------ They do show snippets for GCC, LLVM and Intel, but those look similar but different to the ones I get when trying with gcc, clang/llvm and icx (not icc); in any case I got: https://godbolt.org/z/KGaMvbx68 On Godbolt, the LLVM code is half as long as GCC's and Intel's about the same size as GCCs - and all are way longer than te VeGen snippet shown above. Disclaimer: For -Os, the VeGen is surely the best. But I have not tried to check which of the generated codes is actually fastest. And I don't know whether that this PR helps or not, but before it falls through the cracks, I decided to submit it. #include <stdint.h> void dot_16x1x16_uint8_int8_int32( uint8_t data[restrict 4], int8_t kernel[restrict 16][4], int32_t output[restrict 16]) { for (int i = 0; i < 16; i++) for (int k = 0; k < 4; k++) output[i] += data[k] * kernel[i][k]; }