[Bug rtl-optimization/82153] New: missed optimization: double rounding
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82153 Bug ID: 82153 Summary: missed optimization: double rounding Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: arjan at linux dot intel.com Target Milestone: --- #include int roundme(double A) { return floor(A * 4.3); } leads to : 0: f2 0f 59 05 00 00 00mulsd 0x0(%rip),%xmm0# 8 7: 00 8: 66 0f 3a 0b c0 09 roundsd $0x9,%xmm0,%xmm0 e: f2 0f 2c c0 cvttsd2si %xmm0,%eax 12: c3 retq both roundsd $0x9 and cvttsd2si truncate (floor) their argument, so gcc is doing redundant work here; roundsd is +/- 8 cycles which is not cheap.
[Bug rtl-optimization/82153] missed optimization: double rounding
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82153 --- Comment #2 from Arjan van de Ven --- When a conversion is inexact, a truncated result is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (8000H) is returned. no exception on truncation... 8: 66 0f 3a 0b c0 0b roundsd $0xb,%xmm0,%xmm0 e: f2 0f 2c c0 cvttsd2si %xmm0,%eax is what is generated for tunc() (same code otherwise as above)
[Bug rtl-optimization/82153] missed optimization: double rounding
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82153 --- Comment #4 from Arjan van de Ven --- btw gcc has no issue with just generating cvttsd2si int roundme2(double A) { return A * 4.3; } generates 20: f2 0f 59 05 00 00 00mulsd 0x0(%rip),%xmm0# 28 27: 00 28: f2 0f 2c c0 cvttsd2si %xmm0,%eax 2c: c3 retq so maybe the real missed optimization is a trunc() followed by assign-to-integer as a general thing
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 Arjan van de Ven changed: What|Removed |Added Version|6.1.1 |6.3.0 --- Comment #6 from Arjan van de Ven --- Having poked at this a bunch, I now have a proposed patch (well I know it needs work, but it's there to show the idea) and a set of "evidence" and an improved test case. I'll be attaching the raw files to this bug and then make a summary post as well.
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #7 from Arjan van de Ven --- Created attachment 40416 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40416&action=edit prototype patch
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #8 from Arjan van de Ven --- Created attachment 40417 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40417&action=edit refined test case
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #9 from Arjan van de Ven --- Created attachment 40418 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40418&action=edit Makefile
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #10 from Arjan van de Ven --- Created attachment 40419 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40419&action=edit generated ASM with vectorization (no patch / no fast-math)
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #11 from Arjan van de Ven --- Created attachment 40420 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40420&action=edit generated ASM with vectorization and fast-math (no patch)
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #12 from Arjan van de Ven --- Created attachment 40421 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40421&action=edit generated ASM without vectorization (no patch / no fast-math)
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #13 from Arjan van de Ven --- Created attachment 40422 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40422&action=edit generated ASM with vectorization (with patch / no fast-math)
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #16 from Arjan van de Ven --- A comparable (but optimized to generate smaller asm) testcase is this: #include void RELU(float *buffer, int size) { float *ptr = (float *) __builtin_assume_aligned(buffer, 64); int i; for (i = 0; i < (size * 8); i++) { float f = ptr[i]; ptr[i] = std::max(float(0), f); } } this will generate, without vectorization on x86 the following core asm (-mavx2): vmovss (%rdi), %xmm0 addq$4, %rdi vmaxss %xmm1, %xmm0, %xmm0 vmovss %xmm0, -4(%rdi) cmpq%rax, %rdi jne .L6 but with vectorization enabled one gets vmovaps (%rdi), %ymm1 addl$1, %eax addq$32, %rdi vcmpltps%ymm1, %ymm2, %ymm0 vandps %ymm1, %ymm0, %ymm0 vmovaps %ymm0, -32(%rdi) cmpl%eax, %esi ja .L4 or in other words, the compiler trusts vmax[sp]s for the non-vector case, but does not trust it with the vector case without -ffast-math. when adding -ffast-math to the vectorized case one gets vmaxps (%rdi), %ymm1, %ymm0 addl$1, %eax addq$32, %rdi vmovaps %ymm0, -32(%rdi) cmpl%eax, %esi ja .L4 as the core loop, which is the expected outcome for this case. I will make the argument that gcc is wrong to not trust vmaxps in the vectorization case on x86, because it clearly trusts it in the non-vector case. The attached patch will make this so, but does it for all architectures not just x86; I will seek help to turn this into a proper patch, but wanted to put it here for now to keep track of it.
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #17 from Arjan van de Ven --- (In reply to Andrew Pinski from comment #15) > Read https://gcc.gnu.org/ml/gcc-patches/2015-08/msg00693.html also. There > is much more to that thread than just in August IIRC. Some in September and > in October too. I understand the argument, but if that is true, wouldn't it be unsafe to use vmaxss as well, which gcc DOES generate?
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #19 from Arjan van de Ven --- > GCC is not just about x86. I know that, which is why I know my patch is not correct, but more of a precise bug report... clearly this need to be done in a way that does not hurt other architectures.
[Bug tree-optimization/71921] New: missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 Bug ID: 71921 Summary: missed vectorization optimization Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: arjan at linux dot intel.com Target Milestone: --- program below does not auto-vectorize to use the x86 "maxps" instruction even though gcc is smart enough to know there is a "maxss" instruction... gcc -O3 -ftree-vectorize -fopt-info-vec -fopt-info-vec-missed -march=westmere test.cpp -S -o test.S does not show any use of maxps in test.S #include void relu(float * __restrict__ output, const float * __restrict__ input, int size) { int i; int s2; s2 = size / 4; for (i = 0; i < s2 * 4; i++) { float t; t = input[i]; output[i] = std::max(t, float(0)); } }
[Bug libstdc++/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #2 from Arjan van de Ven --- I tried with <= and it doesn't seem all to eager to be vectorized that way either; fast-math works either way
[Bug target/71921] missed vectorization optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921 --- Comment #5 from Arjan van de Ven --- I don't think that's completely true; it does use maxss (the non-vector one) for this code, so at least something thinks its safe to use max, just likely that something is after the vector phase?
[Bug libstdc++/101583] [12 Regression] error: use of deleted function when building gold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101583 Arjan van de Ven changed: What|Removed |Added CC||arjan at linux dot intel.com --- Comment #6 from Arjan van de Ven --- the original bug was backported to the stable 11 branch... .. should the fix be as well ?
[Bug target/101891] Adjust -fzero-call-used-regs to always use XOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101891 Arjan van de Ven changed: What|Removed |Added CC||arjan at linux dot intel.com --- Comment #7 from Arjan van de Ven --- from a performance angle, the xor-only sequence is not so great at all; modern CPUs are really good at eliminating mov's so that's why the code originally was added to do a combo of xor and mov.. I can understand the security versus performance tradeoff. (the original tuning was done to basically make it entirely free, so that it could just be always enabled)
[Bug target/101891] Adjust -fzero-call-used-regs to always use XOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101891 --- Comment #9 from Arjan van de Ven --- I don't have recent measurements since we did this work quite some time ago. basically on the CPU level (speaking for Intel style cpus at least), a CPU can eliminate (meaning: no execution resources used) 1 to 3 (depending on generation) register to register per clock cycle.. There's ALSO a path in the hardware for optimizing XOR sequences to avoid execution resources... when we did both we maximized the total number of these eliminations... while only XOR you can get bottlenecked on execution if you have too many. (all the mov's should have no other instructions depending on them, so even though they depend on the XOR, they're still fully 'orphan' for the out of order engine)
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 Arjan van de Ven changed: What|Removed |Added CC||arjan at linux dot intel.com --- Comment #1 from Arjan van de Ven --- Actually it's not that they're zero (they are) but they're in "init" state since the vpxor wrote to xmm not ymm