https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87767
--- Comment #5 from Matthias Kretz <kretz at kde dot org> --- > So for #c3 you are essentially asking for a .rodata size optimization. Comment #1 also does so, no? But yes, this is a .rodata optimization and thus potentially a visible reduction on cache pressure. Consider a math function for AVX512 that requires 20 float constants. With the optimization 80 bytes (1 1/4 cache lines) suffice. Without the optimization 1280 bytes (20 cache lines) are required. > The problem [...] many define_insns we have non-EVEX variants mixed with EVEX > variants [...] I see. So the implementation is non-trivial for 16 and 32 byte vectors but should be doable for 64 byte (zmm) vectors? The instructions where broadcast works seem to be guessable. Quoting Intel docs: " 2.6.7 Embedded Broadcast Support in EVEX EVEX encodes an embedded broadcast functionality that is supported on many vector instructions with 32-bit (double word or single-precision floating-point) and 64-bit data elements, and when the source operand is from memory. EVEX.b (P[20]) bit is used to enable broadcast on load-op instructions. When enabled, only one element is loaded from memory and broadcasted to all other elements instead of loading the full memory size. The following instruction classes do not support embedded broadcasting: • Instructions with only one scalar result is written to the vector destination. • Instructions with explicit broadcast functionality provided by its opcode. • Instruction semantic is a pure load or a pure store operation. " Starting with AVX, the vbroadcast* instructions could also be used to do the .rodata size optimization (the performance implication is not obvious to me, but maybe its the right optimization for -Os in any case?). The other relevant (missing?) .rodata optimization is to combine vector constants of different size (and scalars): auto f(double a) { return a + 1.2; } auto f(double a [[gnu::vector_size(16)]]) { return a * 1.2; } auto f(double a [[gnu::vector_size(32)]]) { return a * 1.2; } auto f(double a [[gnu::vector_size(64)]]) { return a * 1.2; } should produce (again, possibly only on -Os): f(double): vaddsd .LC0(%rip), %xmm0, %xmm0 ret f(double __vector(2)): vmulpd .LC0(%rip), %xmm0, %xmm0 ret f(double __vector(4)): vmulpd .LC0(%rip), %ymm0, %ymm0 ret f(double __vector(8)): vmulpd .LC0(%rip), %zmm0, %zmm0 ret .LC0: .long 858993459 .long 1072902963 .long 858993459 .long 1072902963 .long 858993459 .long 1072902963 .long 858993459 .long 1072902963 .long 858993459 .long 1072902963 .long 858993459 .long 1072902963 .long 858993459 .long 1072902963 .long 858993459 .long 1072902963 but instead emits a constant per overload of f. (cf. https://godbolt.org/z/SDr7jG) Finally a quote from the Intel ORM (Version 040, 15.9.2): In Skylake Server microarchitecture, a broadcast instruction with a memory operand of 32 bits or above is executed on the load ports; it is not executed on port 5 as other shuffles are. Alternative 2 in the following example shows how executing the broadcast on the load ports reduces the workload on port 5 and increases performance. Alternative 3 shows how embedded broadcast benefits from both executing the broadcast on the load ports and micro fusion. Example 15-13. Broadcast Executed on Load Ports Alternatives Alternative 1: 32-bit Load and Register Broadcast loop: vmovd xmm0, [rax] vpbroadcastd zmm0, xmm0 vpaddd zmm2, zmm1, zmm0 vpermd zmm2, zmm3, zmm2 inc rax sub rdx, 0x1 jnz loop -> Baseline 1x Alternative 2: Broadcast with a 32-bit Memory Operand loop: vpbroadcastd zmm0, [rax] vpaddd zmm2, zmm1, zmm0 vpermd zmm2, zmm3, zmm2 inc rax sub rdx, 0x1 jnz loop -> Speedup: 1.57x Alternative 3: 32-bit Embedded Broadcast loop: vpaddd zmm2, zmm1, [rax]{1to16} vpermd zmm2, zmm3, zmm2 inc rax sub rdx, 0x1 jnz loop -> Speedup: 1.9x