[Bug inline-asm/87733] local register variable not honored with earlyclobber

2020-03-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87733

--- Comment #14 from Alexander Monakov  ---
Just to clarify, the two testcases added in the quoted commit don't try to
catch the issue discussed here: that the operand is passed in a wrong register.

[Bug inline-asm/87733] local register variable not honored with earlyclobber

2020-03-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87733

--- Comment #21 from Alexander Monakov  ---
> I could guess the compiler might ignore your inputs/outputs that you specify 
> if you don't have any % usages for them.

Are you seriously suggesting that examples in the GCC manual are invalid and
every such usage out there should go and add mentions of referenced registers
in the comment in the inline asm template?

https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html

[Bug rtl-optimization/94728] [haifa-sched][restore_pattern] recalculate INSN_TICK for the dependence type of REG_DEP_CONTROL

2020-04-23 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94728

Alexander Monakov  changed:

   What|Removed |Added

 CC||abel at gcc dot gnu.org
 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Alexander Monakov  ---
On the high level the analysis makes sense to me, but as this is predication in
Haifa scheduler this is not really my domain :)  The bugreport is also missing
a testcase and information about the target.

I see the reporter has just sent an email to the gcc@ mailing list, so I'm
closing the report: https://gcc.gnu.org/pipermail/gcc/2020-April/232192.html

[Bug bootstrap/91972] Bootstrap should use -Wmissing-declarations

2020-05-05 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91972

--- Comment #1 from Alexander Monakov  ---
Another reason to have -Wmissing-declarations is that otherwise mismatches of
unused functions are not caught until it's too late (mismatching definition is
assumed to be an overload of the function declared in the header file).

For a recent example, see
https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545129.html which was
necessary after a mismatch introduced in
https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545114.html

[Bug bootstrap/91972] Bootstrap should use -Wmissing-declarations

2020-05-05 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91972

--- Comment #4 from Alexander Monakov  ---
> Why is it missing the static keyword then? (Or alternatively, why isn't it in 
> an anonymous namespace?)

Huh? Without the warning developers may simply forget to put the 'static'
keyword. With the warning they would be reminded when bootstrapping the patch.


> Ah, I like the namespace thing for target hooks (possibly langhooks as well).

Sure, it's nice to have sensible namespace rules for future additions, but
hopefully that's not a reason/excuse to never re-enable the warning.

[Bug c++/95103] Unexpected -Wclobbered in bits/vector.tcc with -O2

2020-05-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95103

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
Richard's explanation in comment #1 is correct. The compiler assumes any
external call in the destructor can transfer control back to setjmp.

In principle in this case the warning is avoidable by observing that jmp_buf is
local and does not escape, but for any other returns_twice function the problem
would remain, as there's no jmp_buf-like key to track (think vfork).

(iow: solving this would need special-casing warning code for setjmp, which
currently works the same for all functions with the returns_twice attribute)

Let's close this?

[Bug rtl-optimization/95123] [10/11 Regression] Wrong code w/ -O2 -fselective-scheduling2 -funroll-loops --param early-inlining-insns=5 --param loop-invariant-max-bbs-in-loop=3 --param max-jump-thread

2020-05-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95123

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
This is probably due to sel-sched, and very sensitive to compiler revision: I
tried checking with a 20200511 (one day difference) on Compiler Exporer, and
could not reproduce the miscompilation.

If you still have the compiler binary, you can help out by testing with
sel-sched debug counters: if you append -fdbg-cnt=sel_sched_insn_cnt:0 to the
"bad" command line, it should work again (as sel-sched will not move anything),
with -fdbg-cnt=sel_sched_insn_cnt:9 it should fail. We use this for
isolating a problematic transformation (by bisecting on the counter value).

(other sel-sched debug counters are sel_sched_cnt and sel_sched_region_cnt, but
they are more coarse-grained, by pass and region, instead of insn,
respectively)

[Bug c++/95103] Unexpected -Wclobbered in bits/vector.tcc with -O2

2020-05-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95103

--- Comment #5 from Alexander Monakov  ---
No, this analogy does not work. setjmp both sets up a buffer and receives
control, so it corresponds to both try and catch together. A matching "C++"
code would look like:

> void f3() {
> std::vector v;
> for (int i = 0; i != 2; ++i) {
> if (!f2("xx")) f1();
> v.push_back(0);
> }
> try {
> catch (...) {
> }
> }

where it's evident that v does not leave scope and its desctructor cannot be
reached.

(comment #1 and #3 still stand)

[Bug rtl-optimization/95123] [10/11 Regression] Wrong code w/ -O2 -fselective-scheduling2 -funroll-loops --param early-inlining-insns=5 --param loop-invariant-max-bbs-in-loop=3 --param max-jump-thread

2020-05-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95123

--- Comment #6 from Alexander Monakov  ---
Oh, you're probably configuring your compiler with --enable-default-pie. Please
paste the entire gcc -v. I can reproduce the miscompile it if I pass -fpie
-pie.

[Bug c/95379] Don't warn about the universal zero initializer for a structure with the 'designated_init' attribute.

2020-05-28 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95379

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
> does anyone know if it's part of C too?

{ } is valid C++, invalid C; GCC accepts it in C as an extension, and warns
with -pedantic.

I think this enhancement request is reasonable.

[Bug target/95435] bad builtin memcpy performance with znver1/znver2 and 32bit

2020-05-30 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
Ugh. Stringop tuning for Ryzens is terribly anachronistic, all AMD processors
since K8 (!!) use the exact same tables, and 32-bit memset/memcpy don't use
libcall for large sizes:

static stringop_algs znver2_memcpy[2] = {
  {libcall, {{6, loop, false}, {14, unrolled_loop, false},
 {-1, rep_prefix_4_byte, false}}},
  {libcall, {{16, loop, false}, {64, rep_prefix_4_byte, false},
 {-1, libcall, false;

(first subarray is 32-bit tuning, the second is for 64-bit)

Using test_stringop microbenchmark from PR43052 it's easy to see that library
memset/memcpy are fastest on sizes 256 and above. Below that, the result from
the microbenchmark may be debatable, I think we should prefer the libcall
almost always except for tiniest sizes for I-cache locality reasons.

But anyway, current tuning is completely inappropriate.

[Bug target/95435] bad builtin memcpy performance with znver1/znver2 and 32bit

2020-06-01 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435

--- Comment #8 from Alexander Monakov  ---
There's no tuning tables for memcmp at all, existing structs cover only memset
and memcpy. So as far as I see retuning memset/memcpy doesn't need to wait for
[1], because there's no infrastructure in place for memcmp tuning, and adding
that can be done independently. Updating Ryzen tables would not touch any code
updated by H.J.Lu's patchset at all.

[Bug ipa/95558] Invalid IPA optimizations based on weak definition

2020-06-06 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95558

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org,
   ||marxin at gcc dot gnu.org
  Component|middle-end  |ipa
   Keywords||wrong-code

--- Comment #1 from Alexander Monakov  ---
All functions are incorrectly discovered to be pure, and then the loop that
only makes calls to non-weak pure functions is eliminated.

Minimal testcase for the root issue, wrong warning with -O2
-Wsuggest-attribute=pure:

static void dummy(){}

void weak() __attribute__((weak,alias("dummy")));

int foo()
{
weak();
return 0;
}

[Bug other/92396] -ftime-trace support

2020-07-28 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92396

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
Raw data from timevars is not suitable to make a useful-for-users -ftime-trace
report. The point of -ftime-trace is to present the person using the compiler
with a breakdown on the level of their source files, functions, template
instantiations, i.e. something they understand and can change. No need to show
users any sort of breakdown by individual GIMPLE/RTL passes: as far as they are
concerned it's one complex "code generation" phase they cannot substantially
change.

The original blog post by Aras Pranckevičius explains this well, contrasting
against GCC's and LLVM's -ftime-report:
https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/
(and part 2:
https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang/
).

GCC simply doesn't measure time on the relevant "axes": we don't split
preprocessing time by included files, nor do we split template instantiation
time in the C++ frontend by template.

[Bug c/96420] -Wsign-extensions warnings are generated from system header macros

2020-08-02 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96420

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Minimized standalone testcase:

# 1 "foo.c" 1
# 1 "foo.h" 1
# 1 "foo.h" 3
#define C(x) (0u+(x))
# 2 "foo.c" 2

unsigned f(int x)
{
return C(x);
}

[Bug tree-optimization/96633] missed optimization?

2020-08-17 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633

--- Comment #2 from Alexander Monakov  ---
Martin added me to CC so I assume he wants me to chime in.

First of all, I find Nathan's behavior in that gcc@ thread distasteful at best
(but if you ask me, such responses are simply more harm than good; link:
https://lwn.net/ml/gcc/1a363f89-6f98-f583-e22a-a7fc02efb...@acm.org/ ).

Next, statements like "I've determined the following is abot 12% faster" don't
carry weight without details such as the CPU family, structure of the benchmark
and the workload. Obviously, on input that lacks whitespace GCC's original code
is faster as the initial branch is 100% predictable. Likewise, if the input was
taken from /dev/random, the 12% figure is irrelevant to real-world uses of such
code. What the benchmark is doing with the return value of the function also
matters a lot.

With that out of the way: striving to get efficient branchless code on this
code is not very valuable in practice, because the caller is likely to perform
a conditional branch on the result anyway. So making isWhitespace branchless
simply moves the misprediction cost to the caller, making the overall code
slower.

(but of course such considerations are too complex for the compiler's limited
brain)

In general such "bitmask tests" will benefit from the BT instruction on x86
(not an extension, was in the ISA since before I was born), plus CMOV to get
the right mask if it doesn't fit in a register.

For 100% branchless code we want to generate code similar to:

char is_ws(char c)
{
unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
unsigned long long v = c;
if (v > 32)
#if 1
mask = 0;
#else
return 0;
#endif
char r;
asm("bt %1, %2; setc %0" : "=r"(r) : "r"(v), "r"(mask));
return r;
}

movsbq  %dil, %rax
movl$0, %edx
movabsq $4294977024, %rdi
cmpq$33, %rax
cmovnb  %rdx, %rdi
bt %rax, %rdi; setc %al
ret

(note we get %edx zeroing suboptimal, should have used xor %edx, %edx)

This is generalizable to any input type, not just char.

We even already get the "test against a mask" part of the idea right ;)

Branchy testing is even cheaper with BT:

void is_ws_cb(unsigned char c, void f(void))
{
unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
if (c <= 32 && (mask & (1ll<

[Bug tree-optimization/96672] Missing -Wclobbered diagnostic, or: __attribute__((returns_twice)) does not inhibit constant folding across call site

2020-08-18 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96672

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Looking at dumps, after expanding to RTL we do not have the abnormal edge from
the longjmp BB. So while on GIMPLE we preserve modifications of 'x', on RTL we
see the 'x = 6' write as dead, and the 'x = 5' write is propagated to the use.

(the -Wclobbered warning happens after all the propagation is done)

I am surprised the abnormal dispatcher block is not preserved on RTL.

[Bug middle-end/95189] [9/10 Regression] memcmp being wrongly stripped like strcmp

2020-09-03 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #15 from Alexander Monakov  ---
Is the patch eligible for backporting?

Users are hitting this as shown by dups and questions elsewhere like
https://stackoverflow.com/questions/63724679/wrong-gcc-9-and-higher-optimization-of-memcmp-with-fno-inline

[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses

2020-09-07 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
You raise valid points (i.e. it would be good to understand why preallocation
is not beneficial, or what's causing the performance gap w.r.t malloc), but
looking at cache-misses counter does not make sense here (perf is not explicit
about that, but it counts misses in L3, and as you see the count is three
magnitudes lower than that of cycles&instructions, so it's not the main factor
in overall performance picture).

As for comparison against Rust, it spreads more work over available cores: you
can see that its "user time" is higher, though "wall-clock time" is same or
lower. In other words, the C++ variant does not achieve good multicore scaling.

The main gotcha here is m_b_r does not allocate on construction, but rather
allocates 2x of the preallocation size on first call to 'allocate', and then
deallocates when 'release' is called. So it repeatedly calls malloc/free in the
inner benchmark loop, whereas you custom allocator allocates on construction
and deallocates on destruction, avoiding repeated malloc/free calls in the loop
and associated lock contention when multithreaded.

(also obviously it simply does more work in 'allocate', which costs extra
cycles)

[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses

2020-09-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #9 from Alexander Monakov  ---
The most pronounced difference for depth=18 seems to be caused by m_b_r
over-allocating by 2x: internally it mallocs 2x of the size given to the
constructor, and then Linux pre-faults those extra pages, penalizing the
benchmark.

Dividing estimated size by 2 to counter the over-allocation effect:

MemoryPool store (poolSize(stretch_depth) / 2);

substantially improves the benchmark for me.

I think the rest of the slowdown can be attributed to m_b_r simply doing more
work internally compared to your bare-bones malloc allocator (I'm seeing less
pronounced differences though, I'm testing on a Sandybridge CPU with -O2).

[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses

2020-09-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #14 from Alexander Monakov  ---
> It adds 11 bytes to the size given to the constructor (for its internal
> bookkeeping) and then rounds up to a power of two.

What is the purpose of this rounding up?

[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses

2020-09-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #18 from Alexander Monakov  ---
Huh? malloc is capable of splitting the tail of the last page for reuse in
subsequent small allocations, why not let it do it? It will not be "wasted".

[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses

2020-09-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #20 from Alexander Monakov  ---
Round up to 64 bytes (typical cache line size).

[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses

2020-09-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

--- Comment #23 from Alexander Monakov  ---
Are you benchmarking with bt_pmr_0thrd (attached in comment #3) with depth=18?
On earlier tests there are other effects in play too.

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Richard, though register moves are resolved by renaming, they still occupy a
uop in all stages except execution, and since renaming is one of the narrowest
points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of
uops generally helps.

In Michael's the actual memory address has two operands:

<   vmovapd %ymm1, %ymm10
<   vmovapd %ymm1, %ymm11
<   vfnmadd213pd(%rdx,%rax), %ymm9, %ymm10
<   vfnmadd213pd(%rcx,%rax), %ymm7, %ymm11
---
>   vmovupd (%rdx,%rax), %ymm10
>   vmovupd (%rcx,%rax), %ymm11
>   vfnmadd231pd%ymm1, %ymm9, %ymm10
>   vfnmadd231pd%ymm1, %ymm7, %ymm11

The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
renaming (because otherwise there would be too many operands to handle). Hence
the original code has 4 uops after decoding, 6 uops before renaming, and the
transformed code has 4 uops before renaming. Execution handles 4 uops in both
cases.

FMA unlamination is mentioned in
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes

Michael, you can probably measure it for yourself with

   perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #4 from Alexander Monakov  ---
> More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs
> 4 entries by de-transformed sequence.

I don't think this is true for the test at hand? With base+offset memory
operand the renaming stage already sees two separate uops for each fma, so
reservation etc. should also see two for each fma, 4 uops in total. And they
will not be fused.

It would be true if memory operands required just one register (and then
pressure on renaming stage would be the same for both variants).


> For me it's enough to know that it *is* slower.

Understood, but I hope GCC developers want to understand the nature of the
slowdown before attempting to fix it.

[Bug inline-asm/92151] Spurious register copying

2019-10-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92151

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
>> (or write actual assembly rather than using inline-asm).
> In this case, yes -- I now declare the function "naked" and avoid the issue.

I think this solution (hand-writing asm for the entire function) is generally
undesirable because you become responsible for ABI/calling convention, the
compiler won't help you with things like properly restoring callee-saved
registers, and violations may stay unnoticed as long as callers don't try to
use a particular callee-saved reg.

Here's a manually reduced variant that exhibits a similar issue at -O1:

void foo(int num, int c) {
asm("# %0" : "+r"(num));
while (--c)
asm goto("# %0" :: "r"(num) :: l2);
l2:
asm("# %0" :: "r"(num));
}

The main issue seems to be our 'asmcons' pass transforming RTL in such a way
that REG_DEAD notes are "behind" the actual death, so if the RA takes them
literally it operates on wrong (too conservative) lifetime information; e.g.,
for the first asm, just before IRA we have:

(insn 29 4 8 2 (set (reg:SI 84 [ num ])
(reg:SI 85)) "./example.c":3:5 -1
 (nil))
(insn 8 29 7 2 (parallel [
(set (reg:SI 84 [ num ])
(asm_operands:SI ("# %0") ("=r") 0 [
(reg:SI 84 [ num ])
]
 [
(asm_input:SI ("0") ./example.c:3)
]
 [] ./example.c:3))
(clobber (reg:CC 17 flags))
]) "./example.c":3:5 -1
 (expr_list:REG_DEAD (reg:SI 85)
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil

but register 85 actually dies in insn 29, not in insn 8.

[Bug middle-end/92250] valgrind: ira_traverse_loop_tree – Conditional jump or move depends on uninitialised value

2019-10-28 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92250

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Be sure to enable Valgrind annotations (configure with
--enable-valgrind-annotations), otherwise false positives on sparseset
functions are expected: sparse set algorithm accesses uninitialized memory by
design (an explanation is available at e.g. https://research.swtch.com/sparse
).

[Bug rtl-optimization/87047] [7/8/9 Regression] performance regression because of if-conversion

2019-11-05 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87047

--- Comment #16 from Alexander Monakov  ---
I'd like to backport this to gcc-9 branch and then close this bug (Richi
already indicated that further backports are not desirable). Thoughts?

[Bug rtl-optimization/87047] [7/8/9 Regression] performance regression because of if-conversion

2019-11-06 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87047

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #19 from Alexander Monakov  ---
Nothing left to do then, closing.

[Bug tree-optimization/92283] [10 Regression] 454.calculix miscomparison since r276645 with -O2 -march=znver2

2019-11-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92283

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #17 from Alexander Monakov  ---
(In reply to Richard Biener from comment #16)
> interestingly 66:66 and 67:67 generate exactly the same code and
> 66:67 add a single loop.  That's totally odd but probably an
> artifact of a bug in dbg_cnt_is_enabled which does
> 
> bool
> dbg_cnt_is_enabled (enum debug_counter index)
> {
>   unsigned v = count[index];
>   return v > limit_low[index] && v <= limit_high[index];
> }
> 
> where it should be v >= limit_low[index].

This is intentionally like that, the idea is that a:b makes a half-open
interval with the right bound (b) not included.  So 66:66 and 67:67 are both
simply empty intervals.

dbg_cnt_is_enabled tests left bound with '>' and right bound with '<=' because
its caller (dbg_cnt) incremented the counter before the call.

[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out

2019-11-12 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov  ---
The full preprocessed source is provided and it clearly says

typedef unsigned char uint8_t;

in line 10, so it is in fact a character type.

It's suspicious that cmpxchg_using_helper does not return a value (incorrectly
reduced testcase?) and there's still an aliasing violation when
atomic_cmpxchg_func tries to cast 'dest' from uint8_t* to int*. I think the
report was closed prematurely.

Aleksei - always provide output of 'gcc -v' when reporting such bugs, otherwise
people may be unable to reproduce it when there's really a problem (no way to
tell how your compiler was configured or even its exact version).

[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out

2019-11-12 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462

--- Comment #10 from Alexander Monakov  ---
> atomic_cmpxchg_func tries to cast 'dest' from uint8_t* to int*

I made a typo here, I meant uint32_t rather than uint8_t, and there's no
aliasing violation here as signedness difference is explicitly OK.

It doesn't matter if the function in user code is named cmpxchg or dsjfhg,
whether gcc can emit a more efficient bytewise CAS is irrelevant when the user
complains that PRE is miscompiling their code.

uint8_t is obviously a character type in this particular testcase (as well as,
fwiw, on all Glibc targets)

OTOH that cmpxchg_using_helper does not return a value is a serious problem,
that is undefined behavior in C++. You'll need to submit a valid testcase
without that issue.

[Bug rtl-optimization/91161] [9/10 Regression] ICE in begin_move_insn, at sched-ebb.c:175

2019-11-20 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91161

--- Comment #3 from Alexander Monakov  ---
With -fno-dce, a NOTE_INSN_DELETED_LABEL appears between the last "real" insn
in the basic block (a sibcall) and a barrier rtx:

(call_insn/u/c 20 19 12 3 (call (mem:QI (symbol_ref:DI ("ni") [flags 0x3] 
) [0 ni S1 A8])
(const_int 0 [0])) "pr91161.c":23:7 679 {*call}
 (expr_list:REG_DEAD (reg:DI 5 di)
(expr_list:REG_DEAD (reg:QI 0 ax)
(expr_list:REG_CALL_DECL (symbol_ref:DI ("ni") [flags 0x3] 
)
(expr_list:REG_ARGS_SIZE (const_int 0 [0])
(expr_list:REG_NORETURN (const_int 0 [0])
(expr_list:REG_EH_REGION (const_int 0 [0])
(nil)))
(expr_list (use (reg:QI 0 ax))
(expr_list:DI (use (reg:DI 5 di))
(nil
(note 12 20 21 ("x6") NOTE_INSN_DELETED_LABEL 5)
(barrier 21 12 22)


Is this valid? I assume NOTE_INSN_DELETED can appear in that position as well?
If so, shouldn't begin_move_insn use next_nonnote_insn rather than plain
NEXT_INSN to find either the barrier or the label of the next bb?

[Bug c++/92597] std::fma gives nan using -march=sandybridge+ with asm volatile

2019-11-20 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92597

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Testcase isolating what appears to be the biggest issue in the original:

double f()
{
double d=-1;
asm("" : "+m,r"(d));
return d;
}

long double g()
{
long double d=-1;
asm("" : "+m,r"(d));
return d;
}

[Bug c++/92572] Vague linkage does not work reliably when a matching segment is in a dynamically linked libarary on Linux

2019-11-22 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92572

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Please show output of 'cc -v' and attach assembly for main.cc.

Providing vague linkage semantics with dynamic linking is a tricky area,
especially when dlopen is in play, moreso with RTLD_LOCAL as in this example.
For example, if you wanted vague-linkage objects to be unified across multiple
dlopen'ed libraries (each with RTLD_LOCAL), you'd need special support from the
toolchain and the dynamic linker. At some point GNU toolchain invented a new
special ELF symbol binding type, STB_GNU_UNIQUE, but it turned out to cause
other issues. It can be disabled in the compiler with
--disable-gnu-unique-object, in which case the outcome you show here is
expected.

I think on non-GNU systems you'll likely get "1" rather than "2".

[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang

2019-11-25 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov  ---
(In reply to Richard Biener from comment #5)
> 
> "extracting" the actual loops (inlined and all) in intrinsic form as a C
> testcase would be really really nice.

Something like the following?  Enjoy!

typedef unsigned int u32v4 __attribute__((vector_size(16)));
typedef unsigned short u16v16 __attribute__((vector_size(32)));
typedef unsigned char u8v16 __attribute__((vector_size(16)));

union vec128 {
  u8v16 u8;
  u32v4 u32;
};

#define memcpy __builtin_memcpy

u16v16 zxt(u8v16 x)
{
  return (u16v16) {
x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7],
x[8], x[9], x[10], x[11], x[12], x[13], x[14], x[15]
  };
}

u8v16 narrow(u16v16 x)
{
  return (u8v16) {
x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7],
x[8], x[9], x[10], x[11], x[12], x[13], x[14], x[15]
  };
}

void f(char *dst, char *src, unsigned long n, unsigned c)
{
  unsigned ia = 255 - (c >> 24);
  ia += ia >> 7;

  union vec128 c4 = {0}, ia16 = {0};
  c4.u32 += c;
  ia16.u8 += (unsigned char)ia;

  u16v16 c16 = (zxt(c4.u8) << 8) + 128;

  for (; n; src += 16, dst += 16, n -= 4) {
union vec128 s;
memcpy(&s, src, sizeof s);
s.u8 = narrow((zxt(s.u8)*zxt(ia16.u8) + c16) >> 8);
memcpy(dst, &s, sizeof s);
  }
}

[Bug tree-optimization/92768] [8/9/10 Regression] Maybe a wrong code for vector constants

2019-12-03 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92768

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov  ---
Previously, in PR 86999 I pointed out to the reporter that it was okay for gcc
to turn a vector constructor with negative zeros to a trivial
all-positive-zeros constructor under -fno-signed-zeros, and nobody contradicted
me at the time.

I think the documentation needs to be clarified if that's not the intent, right
now I cannot for sure deduce from the manual what exactly the optimizations may
or may not do when constant propagation or such produces a "negative zero"
value.

[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions

2019-12-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
(FWIW, making 'f' a template in your example makes it non-hidden)

Can you explain why you expect the command-line option to override the
attribute on the namespace? GCC usually implements the opposite, i.e.
attributes prevail over the defaults specified on the command line.

In your sample on Godbolt, Clang also appears to honour the attribute rather
than the option.

[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions

2019-12-09 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855

Alexander Monakov  changed:

   What|Removed |Added

 Resolution|INVALID |DUPLICATE

--- Comment #6 from Alexander Monakov  ---
Thanks. PR 47877 is definitely related but not an exact duplicate. Here we have
the visibility attribute on the enclosing namespace, and even though the
documentation does not spell out what should happen, it appears the intent is
that the option should prevail (so inline functions in the namespace would need
to be decorated with the visibility attribute individually to make them
non-hidden).

I'll close this as duplicate and add an example with a namespace to the older
PR.

*** This bug has been marked as a duplicate of bug 47877 ***

[Bug c++/47877] -fvisibility-inlines-hidden does not hide member template functions

2019-12-09 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47877

Alexander Monakov  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #4 from Alexander Monakov  ---
*** Bug 92855 has been marked as a duplicate of this bug. ***

[Bug c++/47877] -fvisibility-inlines-hidden does not hide member template functions

2019-12-09 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47877

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
In PR 92855 we have a similar situation where an inline template function
inherits visibility from the enclosing namespace, while a non-template function
becomes hidden as requested by -fvisibility-inlines-hidden:

namespace N
__attribute__((visibility("default")))
{
inline void foo() {};
template
inline void bar() {};
}

int main()
{
N::foo();
N::bar();
}

[Bug rtl-optimization/92905] New: [10 Regression] Spills float-int union to memory

2019-12-11 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92905

Bug ID: 92905
   Summary: [10 Regression] Spills float-int union to memory
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ra
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

gcc-10 branch regressed for code that needs bitwise operations on floats:

float f(float x)
{
union {float f; unsigned i;} u = {x};
u.i |= 0x8000;
return u.f;
}

float my_copysign(float x, float y)
{
union {float f; unsigned i;} ux = {x}, uy ={y};
ux.i &= 0x7fff;
ux.i |= 0x8000 & uy.i;
return ux.f;
}


For function 'f' gcc-10 -O2 -mtune=intel generates
f:
movd%xmm0, -4(%rsp)
movl$-2147483648, %eax
orl -4(%rsp), %eax
movd%eax, %xmm0
ret

while gcc-9 and earlier generate code without stack use, even without
-mtune=intel:
f:
movd%xmm0, %eax
orl $-2147483648, %eax
movd%eax, %xmm0
ret

Likewise for the more realistic my_copysign, where ux is spilled, but uy is
not.

Eventually it would be nicer to use SSE bitwise operations for this, for
example LLVM already generates
f:
orps.LCPI0_0(%rip), %xmm0

[Bug target/92905] [10 Regression] Spills float-int union to memory

2019-12-11 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92905

--- Comment #4 from Alexander Monakov  ---
Perhaps only xmm0 is problematic, as making xmm0 unused by adding a dummy
argument brings back the old spill-free result:

float my_copysign(float dummy, float x, float y)
{
union {float f; unsigned i;} ux = {x}, uy ={y};
ux.i &= 0x7fff;
ux.i |= 0x8000 & uy.i;
return ux.f;
}

float f(float dummy, float x)
{
union {float f; unsigned i;} u = {x};
u.i |= 0x8000;
return u.f;
}

[Bug rtl-optimization/92953] New: Undesired if-conversion with overflow builtins

2019-12-16 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92953

Bug ID: 92953
   Summary: Undesired if-conversion with overflow builtins
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

Consider:

/* Return 0 if a==b, any positive value if a>b, any negative value otherwise.
*/
int foo(int a, int b)
{
int c;
if (__builtin_sub_overflow(a, b, &c))
c = 1 | ~c;
return c;
}

(suggestions for implementations that would be more efficient on x86 welcome)

on x86 with -Os gives the expected

foo:
subl%esi, %edi
movl%edi, %eax
jno .L1
notl%eax
orl $1, %eax
.L1:
ret

but with -O2 there's if-conversion despite internal-fn.c marking the branch as
"very_unlikely":

foo:
xorl%edx, %edx
subl%esi, %edi
movl%edi, %eax
seto%dl
notl%eax
orl $1, %eax
testl   %edx, %edx
cmove   %edi, %eax
ret

Adding __builtin_expect to the source doesn't help. Adding
__builtin_expect_with_probability helps when specified probability is very low
(<3%), but I feel that shouldn't be required here.

Looking at expand dump, on RTL we start with two branches, first from expanding
the internal fn to calculate a 0/1 predicate value, the second corresponding to
the "if" in the source, branching on testing that predicate against 0. At -Os,
we rely on first if-conversion pass to eliminate the first branch, and then on
combine to optimize the second branch.

Is it possible to expand straight to one branch by noticing that the predicate
is only used in the gimple conditional that follows immediately?

[Bug target/92953] Undesired if-conversion with overflow builtins

2019-12-16 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92953

--- Comment #2 from Alexander Monakov  ---
Well, the aarch64 backend does not implement subv4 pattern in the first
place, which would be required for efficient branchy code:

foo:
subsw0, w0, w1
b.vc.LBB0_2
mvn w0, w0
orr w0, w0, #0x1
.LBB0_2:
ret

This is preferable when the branch is predictable, thanks to shorter dependency
chain.

[Bug target/66120] __builtin_add/sub_overflow for int32_t emit poor code on ARM

2019-12-16 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66120

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||amonakov at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #5 from Alexander Monakov  ---
Looks like the documentation was added in r230651, overflow patterns for arm in
r239739, and for arm64 in r262890.

[Bug target/92953] Undesired if-conversion with overflow builtins

2019-12-16 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92953

--- Comment #4 from Alexander Monakov  ---
At least then GCC should try to use cmovno instead of seto-test-cmove for
if-conversion:

foo:
movl%edi, %eax
subl%esi, %eax
notl%eax
orl $1, %eax
subl%esi, %edi
cmovno  %edi, %eax
ret

[Bug c/93031] Wish: When the underlying ISA does not force pointer alignment, option to make GCC not assume it

2019-12-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93031

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
That must be the most well-written report I've seen so far sacrificed to the
God of Unfairly Closed Bugreports.

Note that GCC aims to allow partial overlap for situations when alignment

[Bug target/93039] New: Fails to use SSE bitwise ops for float-as-int manipulations

2019-12-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

Bug ID: 93039
   Summary: Fails to use SSE bitwise ops for float-as-int
manipulations
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

(the non-regression part of PR 92905)

libm functions need to manipulate individual bits of float/double
representations with good efficiency, but on x86 gcc typically does them on
gprs even when it results in sse-gpreg-sse move chain:

float foo(float x)
{
union {float f; unsigned i;} u = {x};
u.i &= ~0x8000;
return u.f;
}

foo:
movdeax, xmm0
and eax, 2147483647
movdxmm0, eax
ret

It's good to use bitwise ops on general registers if the source or destination
needs to be in a general registe, but for cases like the above creating a
roundtrip is not desirable.

(GCC gets this example right on aarch64; LLVM on x86 compiles this to SSE/AVX
bitwise 'and', taking the immediate from memory)

[Bug target/92905] [10 Regression] Spills float-int union to memory

2019-12-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92905

--- Comment #8 from Alexander Monakov  ---
(In reply to Alexander Monakov from comment #0)
> Eventually it would be nicer to use SSE bitwise operations for this, for
> example LLVM already generates
> f:
> orps.LCPI0_0(%rip), %xmm0

This is now reported separately as PR 93039.

[Bug tree-optimization/93055] accumulation loops in stepanov_vector benchmark use more instruction level parpallelism

2019-12-24 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
GCC is missing a smarter unrolling that would factor dependency chains is such
tiny loops.

Also, how on Earth do we get this invariant computation inside the loop?

> lea0x2100(%rsp),%rdi

That's probably a regression that could be investigated and fixed separately.

[Bug tree-optimization/93055] accumulation loops in stepanov_vector benchmark use more instruction level parpallelism

2019-12-24 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055

--- Comment #2 from Alexander Monakov  ---
Can you attach preprocessed source and double-check command-line flags? I can't
reproduce the problem with lea, and the code does not have explicit prefetch
instructions that I get with -O3 -march=bdver1

[Bug tree-optimization/93056] Poor codegen for heapsort in stephanov_vector benchmark

2019-12-24 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93056

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
The benchmark sorts a 2000-entry random array, so GCC's version runs with high
branch misprediction rate. Clang's version is if-converted, it issues one extra
load compared to gcc.

PRE makes it very difficult to if-convert this on RTL, with -fno-tree-pre we
even get nicer code but still not if-converted, so slower than Clang.

[Bug tree-optimization/93055] accumulation loops in stepanov_vector benchmark use more instruction level parpallelism

2019-12-24 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055

--- Comment #4 from Alexander Monakov  ---
The attachment is edited to test insertion_sort, and doesn't call
accumulate_vector at all - looks like you attached a wrong file?

[Bug c/93072] [8/9/10 Regression] ICE: gimplifier segfault with undefined nested function

2019-12-25 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93072

Alexander Monakov  changed:

   What|Removed |Added

   Keywords||ice-on-invalid-code
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-12-25
 CC||amonakov at gcc dot gnu.org
Summary|ICE: Segmentation fault |[8/9/10 Regression] ICE:
   ||gimplifier segfault with
   ||undefined nested function
 Ever confirmed|0   |1

--- Comment #1 from Alexander Monakov  ---
ICEs since gcc-7; gcc-6 just diagnosed a nested function with no body as
invalid.

[Bug target/93078] Missing fma and round functions auto-vectorization with x86-64 (sse2)

2019-12-27 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93078

Alexander Monakov  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-12-27
 CC||amonakov at gcc dot gnu.org
  Component|tree-optimization   |target
 Ever confirmed|0   |1

--- Comment #1 from Alexander Monakov  ---
> [...] not sure why dont auto-vectorize the function round directly to
> "roundps xmm0, XMMWORD PTR a[rip], 0"

This is because C round function is specified to use a non-standard rounding,
with halfway cases away from zero, not to nearest even. Hence a few extra
instructions to fix up halfway arguments.

For nearbyint, looks like CASE_CFN_NEARBYINT is not handled in
ix86_builtin_vectorized_function. Confirming for this case.

[Bug rtl-optimization/49330] Integer arithmetic on addresses optimised with pointer arithmetic rules

2019-12-30 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49330

--- Comment #29 from Alexander Monakov  ---
(In reply to Alexander Cherepanov from comment #28)
> I see the same even with pure pointers. I guess RTL doesn't care about such
> differences but it means the problem could bite a relatively innocent code.

Can you please open a separate bugreport for this and reference the new bug #
here? It's a separate issue, and it's also a regression, gcc-4.7 did not
miscompile this. The responsible pass seems to be RTL DSE.

[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended

2019-12-31 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #23 from Alexander Monakov  ---
In libcpp, search_line_fast implementations suffer from this, and what is
worse, attempts to workaround it by explicitly requesting zero extension don't
work:

char *foo(char *p, int x)
{
  return p + (unsigned)__builtin_ctz(x);
}

The above code is deliberately asking for zero extension, and yet various
optimizations in GCC transform it back to costlier form with sign extension.

(FWIW, LLVM gets this right)

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-06 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
The compiler has no way of knowing ahead of time that you will be evaluating
the result on random data; for mostly-sorted arrays branching is arguably
preferable.

__builtin_expect_with_probability is a poor proxy for unpredictability: a
condition that is true every other time leads to a branch that is both very
predictable and has probability 0.5.

I think what you really need is a way to express branchless selection in the
source code when you know you need it but the compiler cannot see that on its
own. Other algorithms like constant-time checks for security-sensitive
applications probably also need such computational primitive.

So perhaps an unpopular opinion, but I'd say a __builtin_branchless_select(c,
a, b) (guaranteed to live throughout optimization pipeline as a non-branchy
COND_EXPR) is badly missing.

[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2020-01-08 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

--- Comment #3 from Alexander Monakov  ---
> The question is for which CPUs is it actually faster to use SSE?

In the context of chains where the source and the destination need to be SSE
registers, pretty much all CPUs? Inter-unit moves typically have some latency,
e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for sse<->gpr
moves (surprisingly though four generations prior to Skylake had latency 1).
Older AMDs with shared fpu had even worse latencies. At the same time SSE
integer ops have comparable latencies and throughput to gpr ones, so generally
moving a chain to SSE ops isn't making it slower. Plus it helps with register
pressure.

When either the source or the destination of a chain is bound to a general
register or memory, it's ok to continue doing it on general regs.

[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations

2020-01-09 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

--- Comment #5 from Alexander Monakov  ---
Ah, in that sense. The extra load is problematic in cold code where it's likely
a TLB miss. For hot code: the load does not depend on any previous computations
and so does not increase dependency chains. So it's ok from latency point of
view; from throughput point of view, there's a tradeoff, one extra load per
chain may be ok, but if every other instruction in a chain needs a different
load, that's probably excessive. So it needs to be costed somehow.

That said, sufficiently simple constants can be synthesized with SSE in-place
without loading them from memory, for example the constant in the opening
example:

  pcmpeqd %xmm1, %xmm1  // xmm1 = ~0
  pslld   $31, %xmm1// xmm1 <<= 31

(again, if we need to synthesize just one constant per chain that's preferable,
if we need many, the extra work would need to be costed against the latency
improvement of keeping the chain on SSE)

[Bug target/93274] target_clones produces symbols with random digits with -fPIC

2020-01-15 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93274

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
IFUNCs don't have to be somehow "globally visible". The comment is there from
day 1, but it's not clear why - possibly a misunderstanding? GCC happily
accepts a static ifunc, and rest of toolchain should have no problem either:

__attribute__((used,ifunc("r_f")))
static void f();

static void *r_f()
{
 return 0;
}
.type   r_f, @function
r_f:
xorl%eax, %eax
ret
.size   r_f, .-r_f
.type   f, @gnu_indirect_function
.setf,r_f

[Bug testsuite/90565] [10 regression] test cases gcc.dg/uninit-18.c and uninit-pr90394-1-gimple.c broken as of r271460

2020-01-17 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90565

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
I think this should have been fully resolved by PR90587 fix as indicated by
Vlad in his gcc-patches message:
https://gcc.gnu.org/ml/gcc-patches/2019-05/msg01600.html

[Bug c/93278] huge almost empty array takes huge time to compile and produces huge object file

2020-01-18 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93278

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov  ---
(Jakub - the assembler could emit a file with holes by lseek()'ing over zeroed
areas instead of write()'ing literal zeroes to the file)

I see the bug is closed, but for the sake of adding some clarity:

if bin/gcc by default produces a file named "a.exe", that suggests you're on
Windows. There's a good reason why you're asked to show output of 'gcc -v' (not
'gcc --version'!): it has configuration info including compiler host system.

If you're really on Windows the slowness is probably explained by
Windows-specific I/O overheads, e.g. an antivirus intercepting and blocking
writes. Your timing info amounts to 1 millisecond per 4KB chunk of output. On
Linux the assembler needs 0.007 seconds on my machine, amounting to
microseconds per 4KB chunk.

[Bug target/91838] [8/9 Regression] incorrect use of shr and shrx to shift by 64, missed optimization of vector shift

2020-01-23 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91838

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
For unsigned int this would be undefined behavior via attempt to shift by 32
(but we fail to emit the corresponding warning).

For narrower types (char, short) this seems well-defined.

[Bug target/91824] unnecessary sign-extension after _mm_movemask_epi8 or __builtin_popcount

2020-01-23 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91824

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
A related problem was previously discussed in PR 29776 but this is not a
duplicate. In that PR part of the problem was with UB of clz/ctz for zero
inputs, none of that poses a problem for popcount and pmovmskb.

[Bug rtl-optimization/93402] [8/9/10 Regression] Wrong code when returning padded struct

2020-01-23 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93402

Alexander Monakov  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-01-23
 CC||amonakov at gcc dot gnu.org
Summary|Wrong code when returning   |[8/9/10 Regression] Wrong
   |padded struct   |code when returning padded
   ||struct
 Ever confirmed|0   |1

--- Comment #1 from Alexander Monakov  ---
Broken by postreload-cse, -fdbg-cnt=postreload_cse:0 gives good code.

[Bug middle-end/90348] [8/9/10 Regression] Partition of char arrays is incorrect in some cases

2020-01-23 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #17 from Alexander Monakov  ---
I think part of the problem is trying to make "deaths" explicit via CLOBBERs
without making "births" also explicit in the IR. Doing both would have allowed
a lifetime verifier that checks that births dominate all references to a
variable and likewise deaths/clobbers postdominate all references, which would
presumably catch this early and make the IR more rigorous overall.

[Bug middle-end/90348] [8/9/10 Regression] Partition of char arrays is incorrect in some cases

2020-01-23 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348

--- Comment #19 from Alexander Monakov  ---
(In reply to Michael Matz from comment #18)
> represent all accesses indirectly via pointers

Would that be necessary in presence of a verifier that ensures that all
references are dominated by births?

[Bug target/91838] [8/9 Regression] incorrect use of shr and shrx to shift by 64, missed optimization of vector shift

2020-01-23 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91838

--- Comment #5 from Alexander Monakov  ---
Ah, indeed, it should be explicitly UB, and the documentation should mention
that as well as that implicit integer promotion does not happen for vector
shifts and other operations.

[Bug tree-optimization/93301] Wrong optimization: instability of uninitialized variables leads to nonsense

2020-01-26 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93301

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
(you mean unreachable code, not dead code)

Nice find, this is definitely a bug. As you say, loop unswitching introduces an
unconditional use of an uninitialized variable which otherwise is conditional
and (might be) never executed. The testcase hits a problematic early-out in
is_maybe_undefined:

  /* Uses in stmts always executed when the region header executes
 are fine.  */
  if (dominated_by_p (CDI_DOMINATORS, loop->header, gimple_bb (def)))
continue;

the code does not match the comment, checking postdominators might be correct,
but not dominators.

This was introduced by r245057 for PR71691, so technically a 8/9/10 regression.
Probably worth splitting into a separate PR as this is more serious and might
be more straightforward to fix than the earlier testcases.

[Bug tree-optimization/93444] New: [8/9/10 Regression] unswitching introduces unconditional use of uninitialized variable

2020-01-26 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93444

Bug ID: 93444
   Summary: [8/9/10 Regression] unswitching introduces
unconditional use of uninitialized variable
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: ch3root at openwall dot com
  Target Milestone: ---

Splitting out bug 93301 comments 6 and 7.

__attribute__((noipa))
static int opaque(int i) { return i; }

int main()
{
short x = opaque(1);
short y;

opaque(x - 1);

while (opaque(1)) {
__builtin_printf("x = %d;  x - 1 = %d\n", x, opaque(1) ? x - 1 : 5);

if (opaque(1))
break;

if (x - 1 == y)
opaque(y);
}
}

Prints "x = 1;  x - 1 = 5" at -O3.

Loop unswitching introduces an unconditional use of an uninitialized variable
which otherwise is conditional and never executed. The testcase hits a
problematic early-out in is_maybe_undefined:

  /* Uses in stmts always executed when the region header executes
 are fine.  */
  if (dominated_by_p (CDI_DOMINATORS, loop->header, gimple_bb (def)))
continue;

the code does not match the comment, checking postdominators might be correct,
but not dominators.

This was introduced by r245057 for PR71691.

[Bug tree-optimization/93301] Wrong optimization: instability of uninitialized variables leads to nonsense

2020-01-26 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93301

--- Comment #8 from Alexander Monakov  ---
Pasted that to new PR 93444 (should have done that right away, sorry).

[Bug tree-optimization/93444] [8/9/10 Regression] ssa-loop-im introduces unconditional use of uninitialized variable

2020-01-26 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93444

Alexander Monakov  changed:

   What|Removed |Added

Summary|[8/9/10 Regression] |[8/9/10 Regression]
   |unswitching introduces  |ssa-loop-im introduces
   |unconditional use of|unconditional use of
   |uninitialized variable  |uninitialized variable

--- Comment #1 from Alexander Monakov  ---
Actually, scratch that. I should have double-checked the order of arguments to
dominated_by_p (and the tree dump). This code is not to blame. The problem
starts earlier when tree-ssa-loop-im moves

  _7 = (int) y_21(D);

out of the loop, making the access unconditional when it was conditional in the
loop and actually unreachable at runtime.

(editing the subject to reflect this, but keeping the regression marker)

[Bug tree-optimization/93444] [8/9/10 Regression] ssa-loop-im introduces unconditional use of uninitialized variable

2020-01-27 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93444

--- Comment #5 from Alexander Monakov  ---
The problem is lifting a conditional access. We don't have an example where
lifting an invariant from an always-executed block in a loop to its preheader
poses a problem.

LLVM adopted an approach where hoisting must "freeze" the "poisoned" values
resulting from uninitialized access so they acquire a concrete unpredictable
value: https://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf

[Bug tree-optimization/93491] Wrong optimization: const-function moved over control flow leading to crashes

2020-01-30 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93491

Alexander Monakov  changed:

   What|Removed |Added

 Status|WAITING |NEW
 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
(In reply to Alexander Cherepanov from comment #2)
> > Do you have a testcase were gcc does this optimize without the user adding
> > const and still traps?
> 
> No. I'll file a separate bug if I stumble upon one, so please disregard this
> possibility for now.

GCC will deduce that g is const just fine, it even tells you so with
-Wsuggest-attribute=const

__attribute__((noipa))
void f(int i)
{
__builtin_exit(i);
}

__attribute__((noinline))
int g(int i)
{
return 1 / i;
}

int main()
{
while (1) {
f(0);

f(g(0));
}
}

Thus removing WAITING and confirming.

[Bug tree-optimization/93521] 40% slower in O2 than O1 (tree-pre)

2020-01-31 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93521

Alexander Monakov  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||amonakov at gcc dot gnu.org
 Resolution|--- |DUPLICATE

--- Comment #1 from Alexander Monakov  ---
Dup.

*** This bug has been marked as a duplicate of bug 93056 ***

[Bug tree-optimization/93056] Poor codegen for heapsort in stephanov_vector benchmark

2020-01-31 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93056

Alexander Monakov  changed:

   What|Removed |Added

 CC||hehaochen at hotmail dot com

--- Comment #2 from Alexander Monakov  ---
*** Bug 93521 has been marked as a duplicate of this bug. ***

[Bug c++/92572] Vague linkage does not work reliably when a matching segment is in a dynamically linked libarary on Linux

2020-02-10 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92572

--- Comment #5 from Alexander Monakov  ---
GCC is emitting static_local as @gnu_unique_object, so it should be unified by
the Glibc dynamic linker. You can use 'nm -CD' to check its type after linking
for the main executable and the library to make sure ld keeps it unique, and
LD_DEBUG=all (see 'man ld.so') to see how it gets resolved at runtime.

[Bug rtl-optimization/88879] [9 Regression] ICE in sel_target_adjust_priority, at sel-sched.c:3332

2020-02-11 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88879

--- Comment #15 from Alexander Monakov  ---
This should not be reproducible with current HEAD because the assert was simply
eliminated. If GCC master definitely fails, can you please provide the exact
diagnostic?

As for 9.2 this is sadly expected because the patch was not backported, I will
backport it soon for the next release from the gcc-9 branch (but if you're
building GCC yourself you can easily do it on your end as the patch simply
removes the offending assert). Sorry about the trouble.

[Bug tree-optimization/93734] [8/9/10 Regression] Invalid code generated with -O2 -march=haswell -ftree-vectorize

2020-02-13 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93734

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
I tried to make an equivalent C testcase, but complex ops don't map 1:1 from
Fortran, so it's a bit difficult. Nevertheless, here's a somewhat similar
testcase that aborts on 8/9, works on trunk, but IR and resulting assembly look
quite different:

( needs -O2 -ftree-vectorize -mfma -fcx-limited-range )

__attribute__((noipa))
static
_Complex double
test(_Complex double * __restrict a,
 _Complex double * __restrict x,
 _Complex double t, long jx)
{
long i, j;

for (j = 6, i = 3; i>=0; i--, j-=jx)
x[j] -= t*a[i];

return x[4];
}

int main()
{
_Complex double a[5] = {1, 1, 1, 1, 10};
_Complex double x[9] = {1,1,1,1,1,1,1,1,1};
if (test(a, x, 1, 2))
__builtin_abort();
}

[Bug rtl-optimization/88879] [9 Regression] ICE in sel_target_adjust_priority, at sel-sched.c:3332

2020-02-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88879

Alexander Monakov  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #18 from Alexander Monakov  ---
Fixed.

[Bug rtl-optimization/93743] [9/10 Regression] swapped arguments in atan2l

2020-02-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93743

Alexander Monakov  changed:

   What|Removed |Added

 Target||i?86-*-*, x86_64-*-*
 Status|UNCONFIRMED |NEW
   Keywords||wrong-code
   Last reconfirmed||2020-02-14
  Component|c   |rtl-optimization
 CC||amonakov at gcc dot gnu.org
 Ever confirmed|0   |1
Summary|swapped arguments in atan2l |[9/10 Regression] swapped
   ||arguments in atan2l

--- Comment #1 from Alexander Monakov  ---
Minimal testcase, needs -ffast-math or -Ofast

long double f(long double y, long double x)
{
return __builtin_atan2l(y, x);
}

reg-stack seems to get the order of x87 stack pushes wrong

Correct output in gcc-8:

fldt8(%rsp)
fldt24(%rsp)
fpatan
ret

Wrong output in gcc-9/trunk:

fldt24(%rsp)
fldt8(%rsp)
fpatan
ret

[Bug rtl-optimization/93743] [9/10 Regression] swapped arguments in atan2l

2020-02-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93743

Alexander Monakov  changed:

   What|Removed |Added

 CC||uros at gcc dot gnu.org
  Component|target  |rtl-optimization
   Target Milestone|9.3 |---

--- Comment #2 from Alexander Monakov  ---
Looks related to Uros' svn r264648.

[Bug tree-optimization/93745] Redundant store not eliminated with intermediate instruction

2020-02-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93745

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
The store cannot be eliminated with GCC's memory model where stores through a
pointer can change dynamic type.

The extra load you mention can be optimized because if *p and d overlap,
attempt to load from a 'double' stored value via an 'int *' would have invoked
undefined behavior.

[Bug middle-end/93744] [8/9/10 Regression] Different results between gcc-9 and gcc-7

2020-02-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93744

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
Note that the pattern will get sign of floating-point zero wrong if it ever
triggers for an fp type.

Also, it's too specific: it at least misses eq/ne comparisons, but generally
speaking it doesn't need to be tied to comparisons in the first place: any
multiplication by a boolean value can be converted to a select.

[Bug tree-optimization/93745] Redundant store not eliminated with intermediate instruction

2020-02-14 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93745

--- Comment #4 from Alexander Monakov  ---
Placement new is translated to a plain pointer assignment on GIMPLE, so
optimizers cannot distinguish programs that had placement new from programs
that did not.

(in C we need memory from malloc to be reusable, so imagine that instead of
'double d' the example had a store via a 'double *')

[Bug gcov-profile/93623] No need to dump gcdas when forking

2020-02-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93623

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
(In reply to calixte from comment #2)
> I think the reset is useless in the case of exec** functions since the
> counters are lost when an exec** is called. So it can probably be removed
> too.

exec can fail; resetting only after an (unsuccessful) exec may be ok, but
eliding the reset entirely does not seem so.

[Bug c/93848] missing -Warray-bounds warning for array subscript 1 is outside array bounds

2020-02-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93848

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
Note that 6.5.6 takes care to allow unevaluated * operator:

If the result points one past the last element of the array object,
it shall not be used as the operand of a unary * operator that is
evaluated.

So for example there's no UB in

void bar_aux (int *);
void foo (void)
{
  int i;
  int *p = &i;
  bar_aux (&p[1]);
}

In your example with 'bar', the formal evaluation in the expression 'p[1]' does
not create a copy of the array, it simply strips off one array dimension in the
pointed-to type. So I am pretty sure it was not an intention of the standard to
make that undefined. Perhaps the standard could be edited to make that clearer,
but there's no need to issue a warning here.

[Bug target/93934] Unnecessary fld of uninitialized float stack variable results in ub of valid C++ code

2020-02-26 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93934

Alexander Monakov  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||amonakov at gcc dot gnu.org
 Resolution|--- |INVALID

--- Comment #2 from Alexander Monakov  ---
fcmov can only raise an x87 fpu exception on x87 stack underflow, which cannot
happen here.

Even if it did raise FE_INVALID for SNaNs, note that GCC does not support SNaNs
by default; -fsignaling-nans can be specified to request that, but note that
documentation says the support is incomplete.

No bug here afaict.

[Bug target/93934] Unnecessary fld of uninitialized float stack variable results in ub of valid C++ code

2020-02-26 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93934

--- Comment #5 from Alexander Monakov  ---
Ah, indeed. fld won't raise FE_INVALID for 80-bit long double, but here
'result' is stored on the stack in 64-bit format.

So: fcmov and 80-bit fldt don't trap, 32-bit flds and 64-bit fldl do.

Somehow RTL if-conversion would have to check "-fsignaling-nans is requested
and the target may raise FE_INVALID on loads" among other reason to reject a
speculative load.

I am afraid though that several other optimizations do not anticipate that x87
fp loads can raise exceptions on SNaNs either, making -fsignaling-nans
difficult to implement in full.

[Bug target/93934] Unnecessary fld of uninitialized float stack variable results in ub of valid C++ code

2020-02-26 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93934

--- Comment #8 from Alexander Monakov  ---
I think regstack is fine as x87 only supports computations in its native 80-bit
format and conversions to/from ieee float/double happen only on memory
loads/stores.

> I suppose a fldt followed by "truncation" to 32/64 bit would then trap at
the truncation step?

Such "truncation" can only be implemented via a spill/reload on x87, so, yes.

> We'd have to mark all loads from not must-initialized memory as possibly
> trapping and thus not eligible for if-conversion.

(except long double)

> And this applies to possibly uninitialized registers
> as well which might be spilled or allocated to the stack.

Ideally registers should be always spilled in their native 80-bit format, for
which the problem does not arise. For C with -fexcess-precision=standard this
should already be the case.

[Bug middle-end/56077] [4.6/4.7/4.8 Regression] volatile ignored when function inlined

2013-02-04 Thread amonakov at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56077



Alexander Monakov  changed:



   What|Removed |Added



 CC||amonakov at gcc dot gnu.org



--- Comment #8 from Alexander Monakov  2013-02-04 
17:25:05 UTC ---

The difference in behaviour is due to this change in sched_analyze_insn, inside

"if (reg_pending_barrier)":



+  /* Flush pending lists on jumps, but not on speculative checks.  */

+  if (JUMP_P (insn) && !(sel_sched_p () 

+ && sel_insn_is_speculation_check (insn)))

flush_pending_lists (deps, insn, true, true);



The "JUMP_P (insn) && " part in the condition seems to be an unintended change.


[Bug target/56200] queens benchmark is faster with -O0 than with any other optimization level

2013-02-04 Thread amonakov at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200



--- Comment #2 from Alexander Monakov  2013-02-04 
21:36:38 UTC ---

(In reply to comment #1)

> What happens if you also use -fno-ivopts ?



For me, -fno-ivopts gives a small improvement, but still slower than -O0.  I

think the slowdown is related to code layout in the Icache and branch

predictors. There is a hot region which is composed of three consecutive

conditional branches (cmp-jg-cmp-jg-cmp-jg in optimized code and

mov-cmp-jl-mov-cmp-jl-mov-cmp-jl at -O0). If I align the first _and_ the second

to a 16-byte boundary, I get better performance then -O0, but aligning only one

of those is still slower than -O0:



--- o1.s2013-02-05 00:04:44.405072150 +0400

+++ o1h.s2013-02-05 01:17:43.648014420 +0400

@@ -119,9 +119,11 @@ find:

 movq%rdx, %rbp

 leal1(%r14), %eax

 movl%eax, 12(%rsp)

+.p2align 4,,7

 .L18:

 cmplfile(%r12), %r14d

 jg.L17

+.p2align 4,,7

 cmpl(%r15,%r12), %r14d

 jg.L17

 cmpl(%rbx), %r14d


[Bug target/56200] queens benchmark is faster with -O0 than with any other optimization level

2013-02-05 Thread amonakov at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200



Alexander Monakov  changed:



   What|Removed |Added



 CC||hjl.tools at gmail dot com,

   ||ubizjak at gmail dot com



--- Comment #4 from Alexander Monakov  2013-02-05 
09:46:13 UTC ---

The need for the first alignment is clear: it aligns the loop to a 16-byte

boundary, and gcc does set that alignment at -O2.  Uros, H.J., any idea why

separating the first conditional jump from the rest by additional alignment is

crucial for performance in this case?  Is there anything that can be improved

in GCC here?


[Bug sanitizer/56393] SIGSEGV when -fsanitize=address and dynamic lib with global objects

2013-02-21 Thread amonakov at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56393



Alexander Monakov  changed:



   What|Removed |Added



 CC||amonakov at gcc dot gnu.org



--- Comment #14 from Alexander Monakov  2013-02-21 
10:54:13 UTC ---

(In reply to comment #13)

> We've got this problem on Android, where an instrumented JNI library is loaded

> into Dalvik VM, which is outside of user control. We "solve" it by requiring

> that the runtime library is LD_PRELOAD-ed into the DVM (Android has a 
> mechanism

> to do this on an individual app basis on rooted devices).



OT, but what is this mechanism you speak of?  Currently this bug is the top

google hit for "Dalvik sanitizer LD_PRELOAD", and I don't see how it might work

if the VM only forks, not execs.


[Bug c/56507] GCC -march=native for Core2Duo

2013-03-04 Thread amonakov at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56507



Alexander Monakov  changed:



   What|Removed |Added



 CC||amonakov at gcc dot gnu.org

 Resolution|INVALID |DUPLICATE



--- Comment #5 from Alexander Monakov  2013-03-04 
09:29:32 UTC ---

Looks like a duplicate of PR 39851 then.



*** This bug has been marked as a duplicate of bug 39851 ***


[Bug other/39851] gcc -Q --help=target does not list extensions selected by -march=

2013-03-04 Thread amonakov at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39851



Alexander Monakov  changed:



   What|Removed |Added



 CC||bratsinot at gmail dot com



--- Comment #4 from Alexander Monakov  2013-03-04 
09:29:32 UTC ---

*** Bug 56507 has been marked as a duplicate of this bug. ***


[Bug tree-optimization/53265] Warn when undefined behavior implies smaller iteration count

2013-03-11 Thread amonakov at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53265



--- Comment #10 from Alexander Monakov  2013-03-11 
16:15:36 UTC ---

(In reply to comment #8)

> Not sure about the warning wording



What about (... "iteration %E invokes undefined behavior", max)?



> plus no idea how to call the warning option (-Wnum-loop-iterations, 

> -Wundefined-behavior-in-loop, something else?)



Can it be -Waggressive-loop-optimizations to follow existing pairs of

-{W,fno-}strict-{aliasing,overflow} for the recently added

-fno-aggressive-loop-optimizations?


  1   2   3   4   5   6   7   8   9   10   >