[Bug rtl-optimization/82153] New: missed optimization: double rounding

2017-09-08 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82153

Bug ID: 82153
   Summary: missed optimization: double rounding
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: arjan at linux dot intel.com
  Target Milestone: ---

#include 

int roundme(double A)
{
return floor(A * 4.3);
}

leads to

 :
   0:   f2 0f 59 05 00 00 00mulsd  0x0(%rip),%xmm0# 8 
   7:   00
   8:   66 0f 3a 0b c0 09   roundsd $0x9,%xmm0,%xmm0
   e:   f2 0f 2c c0 cvttsd2si %xmm0,%eax
  12:   c3  retq

both roundsd $0x9 and cvttsd2si truncate (floor) their argument, so gcc is
doing redundant work here; roundsd is +/- 8 cycles which is not cheap.

[Bug rtl-optimization/82153] missed optimization: double rounding

2017-09-08 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82153

--- Comment #2 from Arjan van de Ven  ---
When a conversion is inexact, a truncated result is returned. If a converted
result is larger than the maximum signed doubleword integer, the floating-point
invalid exception is raised, and if this exception is masked, the indefinite
integer value (8000H) is returned.

no exception on truncation...

   8:   66 0f 3a 0b c0 0b   roundsd $0xb,%xmm0,%xmm0
   e:   f2 0f 2c c0 cvttsd2si %xmm0,%eax

is what is generated for tunc() (same code otherwise as above)

[Bug rtl-optimization/82153] missed optimization: double rounding

2017-09-09 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82153

--- Comment #4 from Arjan van de Ven  ---
btw gcc has no issue with just generating cvttsd2si


int roundme2(double A)
{
return A * 4.3;
}

generates

 20:   f2 0f 59 05 00 00 00mulsd  0x0(%rip),%xmm0# 28

  27:   00
  28:   f2 0f 2c c0 cvttsd2si %xmm0,%eax
  2c:   c3  retq


so maybe the real missed optimization is a trunc() followed by
assign-to-integer as a general thing

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

Arjan van de Ven  changed:

   What|Removed |Added

Version|6.1.1   |6.3.0

--- Comment #6 from Arjan van de Ven  ---
Having poked at this a bunch, I now have a proposed patch (well I know it needs
work, but it's there to show the idea) and a set of "evidence" and an improved
test case. I'll be attaching the raw files to this bug and then make a summary
post as well.

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #7 from Arjan van de Ven  ---
Created attachment 40416
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40416&action=edit
prototype patch

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #8 from Arjan van de Ven  ---
Created attachment 40417
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40417&action=edit
refined test case

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #9 from Arjan van de Ven  ---
Created attachment 40418
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40418&action=edit
Makefile

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #10 from Arjan van de Ven  ---
Created attachment 40419
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40419&action=edit
generated ASM with vectorization (no patch / no fast-math)

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #11 from Arjan van de Ven  ---
Created attachment 40420
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40420&action=edit
generated ASM with vectorization and fast-math  (no patch)

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #12 from Arjan van de Ven  ---
Created attachment 40421
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40421&action=edit
generated ASM without vectorization (no patch / no fast-math)

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #13 from Arjan van de Ven  ---
Created attachment 40422
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40422&action=edit
generated ASM with vectorization (with patch / no fast-math)

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #16 from Arjan van de Ven  ---
A comparable (but optimized to generate smaller asm) testcase is this:

#include 
void RELU(float *buffer, int size)
{
float *ptr = (float *) __builtin_assume_aligned(buffer, 64);
int i;
for (i = 0; i < (size * 8); i++) {
float f = ptr[i];
ptr[i] = std::max(float(0), f);
}
}


this will generate, without vectorization on x86 the following core asm
(-mavx2):

vmovss  (%rdi), %xmm0
addq$4, %rdi
vmaxss  %xmm1, %xmm0, %xmm0
vmovss  %xmm0, -4(%rdi)
cmpq%rax, %rdi
jne .L6

but with vectorization enabled one gets

vmovaps (%rdi), %ymm1
addl$1, %eax
addq$32, %rdi
vcmpltps%ymm1, %ymm2, %ymm0
vandps  %ymm1, %ymm0, %ymm0
vmovaps %ymm0, -32(%rdi)
cmpl%eax, %esi
ja  .L4

or in other words, the compiler trusts vmax[sp]s for the non-vector case, but
does not trust it with the vector case without -ffast-math.

when adding -ffast-math to the vectorized case one gets

vmaxps  (%rdi), %ymm1, %ymm0
addl$1, %eax
addq$32, %rdi
vmovaps %ymm0, -32(%rdi)
cmpl%eax, %esi
ja  .L4

as the core loop, which is the expected outcome for this case.

I will make the argument that gcc is wrong to not trust vmaxps in the
vectorization case on x86, because it clearly trusts it in the non-vector case.
The attached patch will make this so, but does it for all architectures not
just x86; I will seek help to turn this into a proper patch, but wanted to put
it here for now to keep track of it.

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #17 from Arjan van de Ven  ---
(In reply to Andrew Pinski from comment #15)
> Read https://gcc.gnu.org/ml/gcc-patches/2015-08/msg00693.html also.  There
> is much more to that thread than just in August IIRC.  Some in September and
> in October too.

I understand the argument, but if that is true, wouldn't it be unsafe to use
vmaxss as well, which gcc DOES generate?

[Bug target/71921] missed vectorization optimization

2016-12-27 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #19 from Arjan van de Ven  ---
>  GCC is not just about x86.

I know that, which is why I know my patch is not correct, but more of a precise
bug report... clearly this need to be done in a way that does not hurt other
architectures.

[Bug tree-optimization/71921] New: missed vectorization optimization

2016-07-18 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

Bug ID: 71921
   Summary: missed vectorization optimization
   Product: gcc
   Version: 6.1.1
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: arjan at linux dot intel.com
  Target Milestone: ---

program below does not auto-vectorize to use the x86 "maxps" instruction even
though gcc is smart enough to know there is a "maxss" instruction...


gcc -O3 -ftree-vectorize -fopt-info-vec -fopt-info-vec-missed -march=westmere
test.cpp -S -o test.S

does not show any use of maxps in test.S



#include 

void relu(float * __restrict__ output, const float * __restrict__ input, int
size)
{
int i;
int s2;

s2 = size / 4;
for (i = 0; i < s2 * 4; i++) {
float t;
t = input[i];
output[i] = std::max(t, float(0));
}
}

[Bug libstdc++/71921] missed vectorization optimization

2016-07-18 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #2 from Arjan van de Ven  ---
I tried with <= and it doesn't seem all to eager to be vectorized that way
either; fast-math works either way

[Bug target/71921] missed vectorization optimization

2016-07-19 Thread arjan at linux dot intel.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #5 from Arjan van de Ven  ---
I don't think that's completely true; it does use maxss (the non-vector one)
for this code, so at least something thinks its safe to use max, just likely
that something is after the vector phase?

[Bug libstdc++/101583] [12 Regression] error: use of deleted function when building gold

2021-10-13 Thread arjan at linux dot intel.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101583

Arjan van de Ven  changed:

   What|Removed |Added

 CC||arjan at linux dot intel.com

--- Comment #6 from Arjan van de Ven  ---
the original bug was backported to the stable 11 branch...
.. should the fix be as well ?

[Bug target/101891] Adjust -fzero-call-used-regs to always use XOR

2022-05-24 Thread arjan at linux dot intel.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101891

Arjan van de Ven  changed:

   What|Removed |Added

 CC||arjan at linux dot intel.com

--- Comment #7 from Arjan van de Ven  ---
from a performance angle, the xor-only sequence is not so great at all; modern
CPUs are really good at eliminating mov's so that's why the code originally was
added to do a combo of xor and mov..

I can understand the security versus performance tradeoff.
(the original tuning was done to basically make it entirely free, so that it
could just be always enabled)

[Bug target/101891] Adjust -fzero-call-used-regs to always use XOR

2022-05-24 Thread arjan at linux dot intel.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101891

--- Comment #9 from Arjan van de Ven  ---
I don't have recent measurements since we did this work quite some time ago.

basically on the CPU level (speaking for Intel style cpus at least), a CPU can
eliminate (meaning: no execution resources used) 1 to 3 (depending on
generation) register to register per clock cycle.. There's ALSO a path in the
hardware for optimizing XOR  sequences to avoid execution
resources... when we did both we maximized the total number of these
eliminations...
while only XOR you can get bottlenecked on execution if you have too many.
(all the mov's should have no other instructions depending on them, so even
though they depend on the XOR, they're still fully 'orphan' for the out of
order engine)

[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero

2021-07-14 Thread arjan at linux dot intel.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456

Arjan van de Ven  changed:

   What|Removed |Added

 CC||arjan at linux dot intel.com

--- Comment #1 from Arjan van de Ven  ---
Actually it's not that they're zero (they are) but they're in "init" state
since the vpxor wrote to xmm not ymm