https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87767

--- Comment #5 from Matthias Kretz <kretz at kde dot org> ---
> So for #c3 you are essentially asking for a .rodata size optimization.

Comment #1 also does so, no? But yes, this is a .rodata optimization and thus
potentially a visible reduction on cache pressure. Consider a math function for
AVX512 that requires 20 float constants. With the optimization 80 bytes (1 1/4
cache lines) suffice. Without the optimization 1280 bytes (20 cache lines) are
required.

> The problem [...] many define_insns we have non-EVEX variants mixed with EVEX 
> variants [...]

I see. So the implementation is non-trivial for 16 and 32 byte vectors but
should be doable for 64 byte (zmm) vectors?
The instructions where broadcast works seem to be guessable. Quoting Intel
docs:
"
2.6.7 Embedded Broadcast Support in EVEX
EVEX encodes an embedded broadcast functionality that is supported on many
vector instructions with 32-bit (double word or single-precision
floating-point)  and 64-bit data elements, and when the source operand is from
memory. EVEX.b (P[20]) bit is used to enable broadcast on load-op instructions.
When enabled, only one element is loaded from memory and broadcasted to all
other elements instead of loading the full memory size.

The following instruction classes do not support embedded broadcasting:
• Instructions with only one scalar result is written to the vector
destination.
• Instructions with explicit broadcast functionality provided by its opcode.
• Instruction semantic is a pure load or a pure store operation.
"

Starting with AVX, the vbroadcast* instructions could also be used to do the
.rodata size optimization (the performance implication is not obvious to me,
but maybe its the right optimization for -Os in any case?).

The other relevant (missing?) .rodata optimization is to combine vector
constants of different size (and scalars):
auto f(double a) {
    return a + 1.2;
}
auto f(double a [[gnu::vector_size(16)]]) {
    return a * 1.2;
}
auto f(double a [[gnu::vector_size(32)]]) {
    return a * 1.2;
}
auto f(double a [[gnu::vector_size(64)]]) {
    return a * 1.2;
}

should produce (again, possibly only on -Os):
f(double):
  vaddsd .LC0(%rip), %xmm0, %xmm0
  ret
f(double __vector(2)):
  vmulpd .LC0(%rip), %xmm0, %xmm0
  ret
f(double __vector(4)):
  vmulpd .LC0(%rip), %ymm0, %ymm0
  ret
f(double __vector(8)):
  vmulpd .LC0(%rip), %zmm0, %zmm0
  ret
.LC0:
  .long 858993459
  .long 1072902963
  .long 858993459
  .long 1072902963
  .long 858993459
  .long 1072902963
  .long 858993459
  .long 1072902963
  .long 858993459
  .long 1072902963
  .long 858993459
  .long 1072902963
  .long 858993459
  .long 1072902963
  .long 858993459
  .long 1072902963

but instead emits a constant per overload of f. (cf.
https://godbolt.org/z/SDr7jG)

Finally a quote from the Intel ORM (Version 040, 15.9.2):

In Skylake Server microarchitecture, a broadcast instruction with a memory
operand of 32 bits or above is executed on the load ports; it is not executed
on port 5 as other shuffles are. Alternative 2 in the following example shows
how executing the broadcast on the load ports reduces the workload on port 5
and increases performance. Alternative 3 shows how embedded broadcast benefits
from both executing the broadcast on the load ports and micro fusion.

Example 15-13. Broadcast Executed on Load Ports Alternatives

Alternative 1: 32-bit Load and Register Broadcast
loop:
vmovd xmm0, [rax]
vpbroadcastd zmm0, xmm0
vpaddd zmm2, zmm1, zmm0
vpermd zmm2, zmm3, zmm2
inc rax
sub rdx, 0x1
jnz loop

-> Baseline 1x

Alternative 2: Broadcast with a 32-bit Memory Operand
loop:
vpbroadcastd zmm0, [rax]
vpaddd zmm2, zmm1, zmm0
vpermd zmm2, zmm3, zmm2
inc rax
sub rdx, 0x1
jnz loop

-> Speedup: 1.57x

Alternative 3: 32-bit Embedded Broadcast
loop:
vpaddd zmm2, zmm1, [rax]{1to16}
vpermd zmm2, zmm3, zmm2
inc rax
sub rdx, 0x1
jnz loop

-> Speedup: 1.9x

Reply via email to