[Bug inline-asm/87733] local register variable not honored with earlyclobber
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87733 --- Comment #14 from Alexander Monakov --- Just to clarify, the two testcases added in the quoted commit don't try to catch the issue discussed here: that the operand is passed in a wrong register.
[Bug inline-asm/87733] local register variable not honored with earlyclobber
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87733 --- Comment #21 from Alexander Monakov --- > I could guess the compiler might ignore your inputs/outputs that you specify > if you don't have any % usages for them. Are you seriously suggesting that examples in the GCC manual are invalid and every such usage out there should go and add mentions of referenced registers in the comment in the inline asm template? https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html
[Bug rtl-optimization/94728] [haifa-sched][restore_pattern] recalculate INSN_TICK for the dependence type of REG_DEP_CONTROL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94728 Alexander Monakov changed: What|Removed |Added CC||abel at gcc dot gnu.org Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #3 from Alexander Monakov --- On the high level the analysis makes sense to me, but as this is predication in Haifa scheduler this is not really my domain :) The bugreport is also missing a testcase and information about the target. I see the reporter has just sent an email to the gcc@ mailing list, so I'm closing the report: https://gcc.gnu.org/pipermail/gcc/2020-April/232192.html
[Bug bootstrap/91972] Bootstrap should use -Wmissing-declarations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91972 --- Comment #1 from Alexander Monakov --- Another reason to have -Wmissing-declarations is that otherwise mismatches of unused functions are not caught until it's too late (mismatching definition is assumed to be an overload of the function declared in the header file). For a recent example, see https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545129.html which was necessary after a mismatch introduced in https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545114.html
[Bug bootstrap/91972] Bootstrap should use -Wmissing-declarations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91972 --- Comment #4 from Alexander Monakov --- > Why is it missing the static keyword then? (Or alternatively, why isn't it in > an anonymous namespace?) Huh? Without the warning developers may simply forget to put the 'static' keyword. With the warning they would be reminded when bootstrapping the patch. > Ah, I like the namespace thing for target hooks (possibly langhooks as well). Sure, it's nice to have sensible namespace rules for future additions, but hopefully that's not a reason/excuse to never re-enable the warning.
[Bug c++/95103] Unexpected -Wclobbered in bits/vector.tcc with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95103 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- Richard's explanation in comment #1 is correct. The compiler assumes any external call in the destructor can transfer control back to setjmp. In principle in this case the warning is avoidable by observing that jmp_buf is local and does not escape, but for any other returns_twice function the problem would remain, as there's no jmp_buf-like key to track (think vfork). (iow: solving this would need special-casing warning code for setjmp, which currently works the same for all functions with the returns_twice attribute) Let's close this?
[Bug rtl-optimization/95123] [10/11 Regression] Wrong code w/ -O2 -fselective-scheduling2 -funroll-loops --param early-inlining-insns=5 --param loop-invariant-max-bbs-in-loop=3 --param max-jump-thread
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95123 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- This is probably due to sel-sched, and very sensitive to compiler revision: I tried checking with a 20200511 (one day difference) on Compiler Exporer, and could not reproduce the miscompilation. If you still have the compiler binary, you can help out by testing with sel-sched debug counters: if you append -fdbg-cnt=sel_sched_insn_cnt:0 to the "bad" command line, it should work again (as sel-sched will not move anything), with -fdbg-cnt=sel_sched_insn_cnt:9 it should fail. We use this for isolating a problematic transformation (by bisecting on the counter value). (other sel-sched debug counters are sel_sched_cnt and sel_sched_region_cnt, but they are more coarse-grained, by pass and region, instead of insn, respectively)
[Bug c++/95103] Unexpected -Wclobbered in bits/vector.tcc with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95103 --- Comment #5 from Alexander Monakov --- No, this analogy does not work. setjmp both sets up a buffer and receives control, so it corresponds to both try and catch together. A matching "C++" code would look like: > void f3() { > std::vector v; > for (int i = 0; i != 2; ++i) { > if (!f2("xx")) f1(); > v.push_back(0); > } > try { > catch (...) { > } > } where it's evident that v does not leave scope and its desctructor cannot be reached. (comment #1 and #3 still stand)
[Bug rtl-optimization/95123] [10/11 Regression] Wrong code w/ -O2 -fselective-scheduling2 -funroll-loops --param early-inlining-insns=5 --param loop-invariant-max-bbs-in-loop=3 --param max-jump-thread
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95123 --- Comment #6 from Alexander Monakov --- Oh, you're probably configuring your compiler with --enable-default-pie. Please paste the entire gcc -v. I can reproduce the miscompile it if I pass -fpie -pie.
[Bug c/95379] Don't warn about the universal zero initializer for a structure with the 'designated_init' attribute.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95379 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- > does anyone know if it's part of C too? { } is valid C++, invalid C; GCC accepts it in C as an extension, and warns with -pedantic. I think this enhancement request is reasonable.
[Bug target/95435] bad builtin memcpy performance with znver1/znver2 and 32bit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- Ugh. Stringop tuning for Ryzens is terribly anachronistic, all AMD processors since K8 (!!) use the exact same tables, and 32-bit memset/memcpy don't use libcall for large sizes: static stringop_algs znver2_memcpy[2] = { {libcall, {{6, loop, false}, {14, unrolled_loop, false}, {-1, rep_prefix_4_byte, false}}}, {libcall, {{16, loop, false}, {64, rep_prefix_4_byte, false}, {-1, libcall, false; (first subarray is 32-bit tuning, the second is for 64-bit) Using test_stringop microbenchmark from PR43052 it's easy to see that library memset/memcpy are fastest on sizes 256 and above. Below that, the result from the microbenchmark may be debatable, I think we should prefer the libcall almost always except for tiniest sizes for I-cache locality reasons. But anyway, current tuning is completely inappropriate.
[Bug target/95435] bad builtin memcpy performance with znver1/znver2 and 32bit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435 --- Comment #8 from Alexander Monakov --- There's no tuning tables for memcmp at all, existing structs cover only memset and memcpy. So as far as I see retuning memset/memcpy doesn't need to wait for [1], because there's no infrastructure in place for memcmp tuning, and adding that can be done independently. Updating Ryzen tables would not touch any code updated by H.J.Lu's patchset at all.
[Bug ipa/95558] Invalid IPA optimizations based on weak definition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95558 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org, ||marxin at gcc dot gnu.org Component|middle-end |ipa Keywords||wrong-code --- Comment #1 from Alexander Monakov --- All functions are incorrectly discovered to be pure, and then the loop that only makes calls to non-weak pure functions is eliminated. Minimal testcase for the root issue, wrong warning with -O2 -Wsuggest-attribute=pure: static void dummy(){} void weak() __attribute__((weak,alias("dummy"))); int foo() { weak(); return 0; }
[Bug other/92396] -ftime-trace support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92396 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- Raw data from timevars is not suitable to make a useful-for-users -ftime-trace report. The point of -ftime-trace is to present the person using the compiler with a breakdown on the level of their source files, functions, template instantiations, i.e. something they understand and can change. No need to show users any sort of breakdown by individual GIMPLE/RTL passes: as far as they are concerned it's one complex "code generation" phase they cannot substantially change. The original blog post by Aras Pranckevičius explains this well, contrasting against GCC's and LLVM's -ftime-report: https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/ (and part 2: https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang/ ). GCC simply doesn't measure time on the relevant "axes": we don't split preprocessing time by included files, nor do we split template instantiation time in the C++ frontend by template.
[Bug c/96420] -Wsign-extensions warnings are generated from system header macros
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96420 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Minimized standalone testcase: # 1 "foo.c" 1 # 1 "foo.h" 1 # 1 "foo.h" 3 #define C(x) (0u+(x)) # 2 "foo.c" 2 unsigned f(int x) { return C(x); }
[Bug tree-optimization/96633] missed optimization?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633 --- Comment #2 from Alexander Monakov --- Martin added me to CC so I assume he wants me to chime in. First of all, I find Nathan's behavior in that gcc@ thread distasteful at best (but if you ask me, such responses are simply more harm than good; link: https://lwn.net/ml/gcc/1a363f89-6f98-f583-e22a-a7fc02efb...@acm.org/ ). Next, statements like "I've determined the following is abot 12% faster" don't carry weight without details such as the CPU family, structure of the benchmark and the workload. Obviously, on input that lacks whitespace GCC's original code is faster as the initial branch is 100% predictable. Likewise, if the input was taken from /dev/random, the 12% figure is irrelevant to real-world uses of such code. What the benchmark is doing with the return value of the function also matters a lot. With that out of the way: striving to get efficient branchless code on this code is not very valuable in practice, because the caller is likely to perform a conditional branch on the result anyway. So making isWhitespace branchless simply moves the misprediction cost to the caller, making the overall code slower. (but of course such considerations are too complex for the compiler's limited brain) In general such "bitmask tests" will benefit from the BT instruction on x86 (not an extension, was in the ISA since before I was born), plus CMOV to get the right mask if it doesn't fit in a register. For 100% branchless code we want to generate code similar to: char is_ws(char c) { unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n'; unsigned long long v = c; if (v > 32) #if 1 mask = 0; #else return 0; #endif char r; asm("bt %1, %2; setc %0" : "=r"(r) : "r"(v), "r"(mask)); return r; } movsbq %dil, %rax movl$0, %edx movabsq $4294977024, %rdi cmpq$33, %rax cmovnb %rdx, %rdi bt %rax, %rdi; setc %al ret (note we get %edx zeroing suboptimal, should have used xor %edx, %edx) This is generalizable to any input type, not just char. We even already get the "test against a mask" part of the idea right ;) Branchy testing is even cheaper with BT: void is_ws_cb(unsigned char c, void f(void)) { unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n'; if (c <= 32 && (mask & (1ll<
[Bug tree-optimization/96672] Missing -Wclobbered diagnostic, or: __attribute__((returns_twice)) does not inhibit constant folding across call site
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96672 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Looking at dumps, after expanding to RTL we do not have the abnormal edge from the longjmp BB. So while on GIMPLE we preserve modifications of 'x', on RTL we see the 'x = 6' write as dead, and the 'x = 5' write is propagated to the use. (the -Wclobbered warning happens after all the propagation is done) I am surprised the abnormal dispatcher block is not preserved on RTL.
[Bug middle-end/95189] [9/10 Regression] memcmp being wrongly stripped like strcmp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #15 from Alexander Monakov --- Is the patch eligible for backporting? Users are hitting this as shown by dups and questions elsewhere like https://stackoverflow.com/questions/63724679/wrong-gcc-9-and-higher-optimization-of-memcmp-with-fno-inline
[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- You raise valid points (i.e. it would be good to understand why preallocation is not beneficial, or what's causing the performance gap w.r.t malloc), but looking at cache-misses counter does not make sense here (perf is not explicit about that, but it counts misses in L3, and as you see the count is three magnitudes lower than that of cycles&instructions, so it's not the main factor in overall performance picture). As for comparison against Rust, it spreads more work over available cores: you can see that its "user time" is higher, though "wall-clock time" is same or lower. In other words, the C++ variant does not achieve good multicore scaling. The main gotcha here is m_b_r does not allocate on construction, but rather allocates 2x of the preallocation size on first call to 'allocate', and then deallocates when 'release' is called. So it repeatedly calls malloc/free in the inner benchmark loop, whereas you custom allocator allocates on construction and deallocates on destruction, avoiding repeated malloc/free calls in the loop and associated lock contention when multithreaded. (also obviously it simply does more work in 'allocate', which costs extra cycles)
[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942 --- Comment #9 from Alexander Monakov --- The most pronounced difference for depth=18 seems to be caused by m_b_r over-allocating by 2x: internally it mallocs 2x of the size given to the constructor, and then Linux pre-faults those extra pages, penalizing the benchmark. Dividing estimated size by 2 to counter the over-allocation effect: MemoryPool store (poolSize(stretch_depth) / 2); substantially improves the benchmark for me. I think the rest of the slowdown can be attributed to m_b_r simply doing more work internally compared to your bare-bones malloc allocator (I'm seeing less pronounced differences though, I'm testing on a Sandybridge CPU with -O2).
[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942 --- Comment #14 from Alexander Monakov --- > It adds 11 bytes to the size given to the constructor (for its internal > bookkeeping) and then rounds up to a power of two. What is the purpose of this rounding up?
[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942 --- Comment #18 from Alexander Monakov --- Huh? malloc is capable of splitting the tail of the last page for reuse in subsequent small allocations, why not let it do it? It will not be "wasted".
[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942 --- Comment #20 from Alexander Monakov --- Round up to 64 bytes (typical cache line size).
[Bug libstdc++/96942] std::pmr::monotonic_buffer_resource causes CPU cache misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942 --- Comment #23 from Alexander Monakov --- Are you benchmarking with bt_pmr_0thrd (attached in comment #3) with depth=18? On earlier tests there are other effects in play too.
[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov --- Richard, though register moves are resolved by renaming, they still occupy a uop in all stages except execution, and since renaming is one of the narrowest points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of uops generally helps. In Michael's the actual memory address has two operands: < vmovapd %ymm1, %ymm10 < vmovapd %ymm1, %ymm11 < vfnmadd213pd(%rdx,%rax), %ymm9, %ymm10 < vfnmadd213pd(%rcx,%rax), %ymm7, %ymm11 --- > vmovupd (%rdx,%rax), %ymm10 > vmovupd (%rcx,%rax), %ymm11 > vfnmadd231pd%ymm1, %ymm9, %ymm10 > vfnmadd231pd%ymm1, %ymm7, %ymm11 The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before renaming (because otherwise there would be too many operands to handle). Hence the original code has 4 uops after decoding, 6 uops before renaming, and the transformed code has 4 uops before renaming. Execution handles 4 uops in both cases. FMA unlamination is mentioned in https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes Michael, you can probably measure it for yourself with perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots
[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #4 from Alexander Monakov --- > More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs > 4 entries by de-transformed sequence. I don't think this is true for the test at hand? With base+offset memory operand the renaming stage already sees two separate uops for each fma, so reservation etc. should also see two for each fma, 4 uops in total. And they will not be fused. It would be true if memory operands required just one register (and then pressure on renaming stage would be the same for both variants). > For me it's enough to know that it *is* slower. Understood, but I hope GCC developers want to understand the nature of the slowdown before attempting to fix it.
[Bug inline-asm/92151] Spurious register copying
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92151 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- >> (or write actual assembly rather than using inline-asm). > In this case, yes -- I now declare the function "naked" and avoid the issue. I think this solution (hand-writing asm for the entire function) is generally undesirable because you become responsible for ABI/calling convention, the compiler won't help you with things like properly restoring callee-saved registers, and violations may stay unnoticed as long as callers don't try to use a particular callee-saved reg. Here's a manually reduced variant that exhibits a similar issue at -O1: void foo(int num, int c) { asm("# %0" : "+r"(num)); while (--c) asm goto("# %0" :: "r"(num) :: l2); l2: asm("# %0" :: "r"(num)); } The main issue seems to be our 'asmcons' pass transforming RTL in such a way that REG_DEAD notes are "behind" the actual death, so if the RA takes them literally it operates on wrong (too conservative) lifetime information; e.g., for the first asm, just before IRA we have: (insn 29 4 8 2 (set (reg:SI 84 [ num ]) (reg:SI 85)) "./example.c":3:5 -1 (nil)) (insn 8 29 7 2 (parallel [ (set (reg:SI 84 [ num ]) (asm_operands:SI ("# %0") ("=r") 0 [ (reg:SI 84 [ num ]) ] [ (asm_input:SI ("0") ./example.c:3) ] [] ./example.c:3)) (clobber (reg:CC 17 flags)) ]) "./example.c":3:5 -1 (expr_list:REG_DEAD (reg:SI 85) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil but register 85 actually dies in insn 29, not in insn 8.
[Bug middle-end/92250] valgrind: ira_traverse_loop_tree – Conditional jump or move depends on uninitialised value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92250 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Be sure to enable Valgrind annotations (configure with --enable-valgrind-annotations), otherwise false positives on sparseset functions are expected: sparse set algorithm accesses uninitialized memory by design (an explanation is available at e.g. https://research.swtch.com/sparse ).
[Bug rtl-optimization/87047] [7/8/9 Regression] performance regression because of if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87047 --- Comment #16 from Alexander Monakov --- I'd like to backport this to gcc-9 branch and then close this bug (Richi already indicated that further backports are not desirable). Thoughts?
[Bug rtl-optimization/87047] [7/8/9 Regression] performance regression because of if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87047 Alexander Monakov changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #19 from Alexander Monakov --- Nothing left to do then, closing.
[Bug tree-optimization/92283] [10 Regression] 454.calculix miscomparison since r276645 with -O2 -march=znver2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92283 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #17 from Alexander Monakov --- (In reply to Richard Biener from comment #16) > interestingly 66:66 and 67:67 generate exactly the same code and > 66:67 add a single loop. That's totally odd but probably an > artifact of a bug in dbg_cnt_is_enabled which does > > bool > dbg_cnt_is_enabled (enum debug_counter index) > { > unsigned v = count[index]; > return v > limit_low[index] && v <= limit_high[index]; > } > > where it should be v >= limit_low[index]. This is intentionally like that, the idea is that a:b makes a half-open interval with the right bound (b) not included. So 66:66 and 67:67 are both simply empty intervals. dbg_cnt_is_enabled tests left bound with '>' and right bound with '<=' because its caller (dbg_cnt) incremented the counter before the call.
[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov --- The full preprocessed source is provided and it clearly says typedef unsigned char uint8_t; in line 10, so it is in fact a character type. It's suspicious that cmpxchg_using_helper does not return a value (incorrectly reduced testcase?) and there's still an aliasing violation when atomic_cmpxchg_func tries to cast 'dest' from uint8_t* to int*. I think the report was closed prematurely. Aleksei - always provide output of 'gcc -v' when reporting such bugs, otherwise people may be unable to reproduce it when there's really a problem (no way to tell how your compiler was configured or even its exact version).
[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462 --- Comment #10 from Alexander Monakov --- > atomic_cmpxchg_func tries to cast 'dest' from uint8_t* to int* I made a typo here, I meant uint32_t rather than uint8_t, and there's no aliasing violation here as signedness difference is explicitly OK. It doesn't matter if the function in user code is named cmpxchg or dsjfhg, whether gcc can emit a more efficient bytewise CAS is irrelevant when the user complains that PRE is miscompiling their code. uint8_t is obviously a character type in this particular testcase (as well as, fwiw, on all Glibc targets) OTOH that cmpxchg_using_helper does not return a value is a serious problem, that is undefined behavior in C++. You'll need to submit a valid testcase without that issue.
[Bug rtl-optimization/91161] [9/10 Regression] ICE in begin_move_insn, at sched-ebb.c:175
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91161 --- Comment #3 from Alexander Monakov --- With -fno-dce, a NOTE_INSN_DELETED_LABEL appears between the last "real" insn in the basic block (a sibcall) and a barrier rtx: (call_insn/u/c 20 19 12 3 (call (mem:QI (symbol_ref:DI ("ni") [flags 0x3] ) [0 ni S1 A8]) (const_int 0 [0])) "pr91161.c":23:7 679 {*call} (expr_list:REG_DEAD (reg:DI 5 di) (expr_list:REG_DEAD (reg:QI 0 ax) (expr_list:REG_CALL_DECL (symbol_ref:DI ("ni") [flags 0x3] ) (expr_list:REG_ARGS_SIZE (const_int 0 [0]) (expr_list:REG_NORETURN (const_int 0 [0]) (expr_list:REG_EH_REGION (const_int 0 [0]) (nil))) (expr_list (use (reg:QI 0 ax)) (expr_list:DI (use (reg:DI 5 di)) (nil (note 12 20 21 ("x6") NOTE_INSN_DELETED_LABEL 5) (barrier 21 12 22) Is this valid? I assume NOTE_INSN_DELETED can appear in that position as well? If so, shouldn't begin_move_insn use next_nonnote_insn rather than plain NEXT_INSN to find either the barrier or the label of the next bb?
[Bug c++/92597] std::fma gives nan using -march=sandybridge+ with asm volatile
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92597 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- Testcase isolating what appears to be the biggest issue in the original: double f() { double d=-1; asm("" : "+m,r"(d)); return d; } long double g() { long double d=-1; asm("" : "+m,r"(d)); return d; }
[Bug c++/92572] Vague linkage does not work reliably when a matching segment is in a dynamically linked libarary on Linux
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92572 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov --- Please show output of 'cc -v' and attach assembly for main.cc. Providing vague linkage semantics with dynamic linking is a tricky area, especially when dlopen is in play, moreso with RTLD_LOCAL as in this example. For example, if you wanted vague-linkage objects to be unified across multiple dlopen'ed libraries (each with RTLD_LOCAL), you'd need special support from the toolchain and the dynamic linker. At some point GNU toolchain invented a new special ELF symbol binding type, STB_GNU_UNIQUE, but it turned out to cause other issues. It can be disabled in the compiler with --disable-gnu-unique-object, in which case the outcome you show here is expected. I think on non-GNU systems you'll likely get "1" rather than "2".
[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov --- (In reply to Richard Biener from comment #5) > > "extracting" the actual loops (inlined and all) in intrinsic form as a C > testcase would be really really nice. Something like the following? Enjoy! typedef unsigned int u32v4 __attribute__((vector_size(16))); typedef unsigned short u16v16 __attribute__((vector_size(32))); typedef unsigned char u8v16 __attribute__((vector_size(16))); union vec128 { u8v16 u8; u32v4 u32; }; #define memcpy __builtin_memcpy u16v16 zxt(u8v16 x) { return (u16v16) { x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8], x[9], x[10], x[11], x[12], x[13], x[14], x[15] }; } u8v16 narrow(u16v16 x) { return (u8v16) { x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8], x[9], x[10], x[11], x[12], x[13], x[14], x[15] }; } void f(char *dst, char *src, unsigned long n, unsigned c) { unsigned ia = 255 - (c >> 24); ia += ia >> 7; union vec128 c4 = {0}, ia16 = {0}; c4.u32 += c; ia16.u8 += (unsigned char)ia; u16v16 c16 = (zxt(c4.u8) << 8) + 128; for (; n; src += 16, dst += 16, n -= 4) { union vec128 s; memcpy(&s, src, sizeof s); s.u8 = narrow((zxt(s.u8)*zxt(ia16.u8) + c16) >> 8); memcpy(dst, &s, sizeof s); } }
[Bug tree-optimization/92768] [8/9/10 Regression] Maybe a wrong code for vector constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92768 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov --- Previously, in PR 86999 I pointed out to the reporter that it was okay for gcc to turn a vector constructor with negative zeros to a trivial all-positive-zeros constructor under -fno-signed-zeros, and nobody contradicted me at the time. I think the documentation needs to be clarified if that's not the intent, right now I cannot for sure deduce from the manual what exactly the optimizations may or may not do when constant propagation or such produces a "negative zero" value.
[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- (FWIW, making 'f' a template in your example makes it non-hidden) Can you explain why you expect the command-line option to override the attribute on the namespace? GCC usually implements the opposite, i.e. attributes prevail over the defaults specified on the command line. In your sample on Godbolt, Clang also appears to honour the attribute rather than the option.
[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855 Alexander Monakov changed: What|Removed |Added Resolution|INVALID |DUPLICATE --- Comment #6 from Alexander Monakov --- Thanks. PR 47877 is definitely related but not an exact duplicate. Here we have the visibility attribute on the enclosing namespace, and even though the documentation does not spell out what should happen, it appears the intent is that the option should prevail (so inline functions in the namespace would need to be decorated with the visibility attribute individually to make them non-hidden). I'll close this as duplicate and add an example with a namespace to the older PR. *** This bug has been marked as a duplicate of bug 47877 ***
[Bug c++/47877] -fvisibility-inlines-hidden does not hide member template functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47877 Alexander Monakov changed: What|Removed |Added CC||thiago at kde dot org --- Comment #4 from Alexander Monakov --- *** Bug 92855 has been marked as a duplicate of this bug. ***
[Bug c++/47877] -fvisibility-inlines-hidden does not hide member template functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47877 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- In PR 92855 we have a similar situation where an inline template function inherits visibility from the enclosing namespace, while a non-template function becomes hidden as requested by -fvisibility-inlines-hidden: namespace N __attribute__((visibility("default"))) { inline void foo() {}; template inline void bar() {}; } int main() { N::foo(); N::bar(); }
[Bug rtl-optimization/92905] New: [10 Regression] Spills float-int union to memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92905 Bug ID: 92905 Summary: [10 Regression] Spills float-int union to memory Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization, ra Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- gcc-10 branch regressed for code that needs bitwise operations on floats: float f(float x) { union {float f; unsigned i;} u = {x}; u.i |= 0x8000; return u.f; } float my_copysign(float x, float y) { union {float f; unsigned i;} ux = {x}, uy ={y}; ux.i &= 0x7fff; ux.i |= 0x8000 & uy.i; return ux.f; } For function 'f' gcc-10 -O2 -mtune=intel generates f: movd%xmm0, -4(%rsp) movl$-2147483648, %eax orl -4(%rsp), %eax movd%eax, %xmm0 ret while gcc-9 and earlier generate code without stack use, even without -mtune=intel: f: movd%xmm0, %eax orl $-2147483648, %eax movd%eax, %xmm0 ret Likewise for the more realistic my_copysign, where ux is spilled, but uy is not. Eventually it would be nicer to use SSE bitwise operations for this, for example LLVM already generates f: orps.LCPI0_0(%rip), %xmm0
[Bug target/92905] [10 Regression] Spills float-int union to memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92905 --- Comment #4 from Alexander Monakov --- Perhaps only xmm0 is problematic, as making xmm0 unused by adding a dummy argument brings back the old spill-free result: float my_copysign(float dummy, float x, float y) { union {float f; unsigned i;} ux = {x}, uy ={y}; ux.i &= 0x7fff; ux.i |= 0x8000 & uy.i; return ux.f; } float f(float dummy, float x) { union {float f; unsigned i;} u = {x}; u.i |= 0x8000; return u.f; }
[Bug rtl-optimization/92953] New: Undesired if-conversion with overflow builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92953 Bug ID: 92953 Summary: Undesired if-conversion with overflow builtins Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- Consider: /* Return 0 if a==b, any positive value if a>b, any negative value otherwise. */ int foo(int a, int b) { int c; if (__builtin_sub_overflow(a, b, &c)) c = 1 | ~c; return c; } (suggestions for implementations that would be more efficient on x86 welcome) on x86 with -Os gives the expected foo: subl%esi, %edi movl%edi, %eax jno .L1 notl%eax orl $1, %eax .L1: ret but with -O2 there's if-conversion despite internal-fn.c marking the branch as "very_unlikely": foo: xorl%edx, %edx subl%esi, %edi movl%edi, %eax seto%dl notl%eax orl $1, %eax testl %edx, %edx cmove %edi, %eax ret Adding __builtin_expect to the source doesn't help. Adding __builtin_expect_with_probability helps when specified probability is very low (<3%), but I feel that shouldn't be required here. Looking at expand dump, on RTL we start with two branches, first from expanding the internal fn to calculate a 0/1 predicate value, the second corresponding to the "if" in the source, branching on testing that predicate against 0. At -Os, we rely on first if-conversion pass to eliminate the first branch, and then on combine to optimize the second branch. Is it possible to expand straight to one branch by noticing that the predicate is only used in the gimple conditional that follows immediately?
[Bug target/92953] Undesired if-conversion with overflow builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92953 --- Comment #2 from Alexander Monakov --- Well, the aarch64 backend does not implement subv4 pattern in the first place, which would be required for efficient branchy code: foo: subsw0, w0, w1 b.vc.LBB0_2 mvn w0, w0 orr w0, w0, #0x1 .LBB0_2: ret This is preferable when the branch is predictable, thanks to shorter dependency chain.
[Bug target/66120] __builtin_add/sub_overflow for int32_t emit poor code on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66120 Alexander Monakov changed: What|Removed |Added Status|NEW |RESOLVED CC||amonakov at gcc dot gnu.org Resolution|--- |FIXED --- Comment #5 from Alexander Monakov --- Looks like the documentation was added in r230651, overflow patterns for arm in r239739, and for arm64 in r262890.
[Bug target/92953] Undesired if-conversion with overflow builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92953 --- Comment #4 from Alexander Monakov --- At least then GCC should try to use cmovno instead of seto-test-cmove for if-conversion: foo: movl%edi, %eax subl%esi, %eax notl%eax orl $1, %eax subl%esi, %edi cmovno %edi, %eax ret
[Bug c/93031] Wish: When the underlying ISA does not force pointer alignment, option to make GCC not assume it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93031 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov --- That must be the most well-written report I've seen so far sacrificed to the God of Unfairly Closed Bugreports. Note that GCC aims to allow partial overlap for situations when alignment
[Bug target/93039] New: Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 Bug ID: 93039 Summary: Fails to use SSE bitwise ops for float-as-int manipulations Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- (the non-regression part of PR 92905) libm functions need to manipulate individual bits of float/double representations with good efficiency, but on x86 gcc typically does them on gprs even when it results in sse-gpreg-sse move chain: float foo(float x) { union {float f; unsigned i;} u = {x}; u.i &= ~0x8000; return u.f; } foo: movdeax, xmm0 and eax, 2147483647 movdxmm0, eax ret It's good to use bitwise ops on general registers if the source or destination needs to be in a general registe, but for cases like the above creating a roundtrip is not desirable. (GCC gets this example right on aarch64; LLVM on x86 compiles this to SSE/AVX bitwise 'and', taking the immediate from memory)
[Bug target/92905] [10 Regression] Spills float-int union to memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92905 --- Comment #8 from Alexander Monakov --- (In reply to Alexander Monakov from comment #0) > Eventually it would be nicer to use SSE bitwise operations for this, for > example LLVM already generates > f: > orps.LCPI0_0(%rip), %xmm0 This is now reported separately as PR 93039.
[Bug tree-optimization/93055] accumulation loops in stepanov_vector benchmark use more instruction level parpallelism
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- GCC is missing a smarter unrolling that would factor dependency chains is such tiny loops. Also, how on Earth do we get this invariant computation inside the loop? > lea0x2100(%rsp),%rdi That's probably a regression that could be investigated and fixed separately.
[Bug tree-optimization/93055] accumulation loops in stepanov_vector benchmark use more instruction level parpallelism
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055 --- Comment #2 from Alexander Monakov --- Can you attach preprocessed source and double-check command-line flags? I can't reproduce the problem with lea, and the code does not have explicit prefetch instructions that I get with -O3 -march=bdver1
[Bug tree-optimization/93056] Poor codegen for heapsort in stephanov_vector benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93056 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- The benchmark sorts a 2000-entry random array, so GCC's version runs with high branch misprediction rate. Clang's version is if-converted, it issues one extra load compared to gcc. PRE makes it very difficult to if-convert this on RTL, with -fno-tree-pre we even get nicer code but still not if-converted, so slower than Clang.
[Bug tree-optimization/93055] accumulation loops in stepanov_vector benchmark use more instruction level parpallelism
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055 --- Comment #4 from Alexander Monakov --- The attachment is edited to test insertion_sort, and doesn't call accumulate_vector at all - looks like you attached a wrong file?
[Bug c/93072] [8/9/10 Regression] ICE: gimplifier segfault with undefined nested function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93072 Alexander Monakov changed: What|Removed |Added Keywords||ice-on-invalid-code Status|UNCONFIRMED |NEW Last reconfirmed||2019-12-25 CC||amonakov at gcc dot gnu.org Summary|ICE: Segmentation fault |[8/9/10 Regression] ICE: ||gimplifier segfault with ||undefined nested function Ever confirmed|0 |1 --- Comment #1 from Alexander Monakov --- ICEs since gcc-7; gcc-6 just diagnosed a nested function with no body as invalid.
[Bug target/93078] Missing fma and round functions auto-vectorization with x86-64 (sse2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93078 Alexander Monakov changed: What|Removed |Added Keywords||missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed||2019-12-27 CC||amonakov at gcc dot gnu.org Component|tree-optimization |target Ever confirmed|0 |1 --- Comment #1 from Alexander Monakov --- > [...] not sure why dont auto-vectorize the function round directly to > "roundps xmm0, XMMWORD PTR a[rip], 0" This is because C round function is specified to use a non-standard rounding, with halfway cases away from zero, not to nearest even. Hence a few extra instructions to fix up halfway arguments. For nearbyint, looks like CASE_CFN_NEARBYINT is not handled in ix86_builtin_vectorized_function. Confirming for this case.
[Bug rtl-optimization/49330] Integer arithmetic on addresses optimised with pointer arithmetic rules
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49330 --- Comment #29 from Alexander Monakov --- (In reply to Alexander Cherepanov from comment #28) > I see the same even with pure pointers. I guess RTL doesn't care about such > differences but it means the problem could bite a relatively innocent code. Can you please open a separate bugreport for this and reference the new bug # here? It's a separate issue, and it's also a regression, gcc-4.7 did not miscompile this. The responsible pass seems to be RTL DSE.
[Bug target/29776] result of ffs/clz/ctz/popcount/parity are already sign-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29776 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #23 from Alexander Monakov --- In libcpp, search_line_fast implementations suffer from this, and what is worse, attempts to workaround it by explicitly requesting zero extension don't work: char *foo(char *p, int x) { return p + (unsigned)__builtin_ctz(x); } The above code is deliberately asking for zero extension, and yet various optimizations in GCC transform it back to costlier form with sign extension. (FWIW, LLVM gets this right)
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- The compiler has no way of knowing ahead of time that you will be evaluating the result on random data; for mostly-sorted arrays branching is arguably preferable. __builtin_expect_with_probability is a poor proxy for unpredictability: a condition that is true every other time leads to a branch that is both very predictable and has probability 0.5. I think what you really need is a way to express branchless selection in the source code when you know you need it but the compiler cannot see that on its own. Other algorithms like constant-time checks for security-sensitive applications probably also need such computational primitive. So perhaps an unpopular opinion, but I'd say a __builtin_branchless_select(c, a, b) (guaranteed to live throughout optimization pipeline as a non-branchy COND_EXPR) is badly missing.
[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 --- Comment #3 from Alexander Monakov --- > The question is for which CPUs is it actually faster to use SSE? In the context of chains where the source and the destination need to be SSE registers, pretty much all CPUs? Inter-unit moves typically have some latency, e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for sse<->gpr moves (surprisingly though four generations prior to Skylake had latency 1). Older AMDs with shared fpu had even worse latencies. At the same time SSE integer ops have comparable latencies and throughput to gpr ones, so generally moving a chain to SSE ops isn't making it slower. Plus it helps with register pressure. When either the source or the destination of a chain is bound to a general register or memory, it's ok to continue doing it on general regs.
[Bug target/93039] Fails to use SSE bitwise ops for float-as-int manipulations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039 --- Comment #5 from Alexander Monakov --- Ah, in that sense. The extra load is problematic in cold code where it's likely a TLB miss. For hot code: the load does not depend on any previous computations and so does not increase dependency chains. So it's ok from latency point of view; from throughput point of view, there's a tradeoff, one extra load per chain may be ok, but if every other instruction in a chain needs a different load, that's probably excessive. So it needs to be costed somehow. That said, sufficiently simple constants can be synthesized with SSE in-place without loading them from memory, for example the constant in the opening example: pcmpeqd %xmm1, %xmm1 // xmm1 = ~0 pslld $31, %xmm1// xmm1 <<= 31 (again, if we need to synthesize just one constant per chain that's preferable, if we need many, the extra work would need to be costed against the latency improvement of keeping the chain on SSE)
[Bug target/93274] target_clones produces symbols with random digits with -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93274 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- IFUNCs don't have to be somehow "globally visible". The comment is there from day 1, but it's not clear why - possibly a misunderstanding? GCC happily accepts a static ifunc, and rest of toolchain should have no problem either: __attribute__((used,ifunc("r_f"))) static void f(); static void *r_f() { return 0; } .type r_f, @function r_f: xorl%eax, %eax ret .size r_f, .-r_f .type f, @gnu_indirect_function .setf,r_f
[Bug testsuite/90565] [10 regression] test cases gcc.dg/uninit-18.c and uninit-pr90394-1-gimple.c broken as of r271460
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90565 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- I think this should have been fully resolved by PR90587 fix as indicated by Vlad in his gcc-patches message: https://gcc.gnu.org/ml/gcc-patches/2019-05/msg01600.html
[Bug c/93278] huge almost empty array takes huge time to compile and produces huge object file
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93278 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov --- (Jakub - the assembler could emit a file with holes by lseek()'ing over zeroed areas instead of write()'ing literal zeroes to the file) I see the bug is closed, but for the sake of adding some clarity: if bin/gcc by default produces a file named "a.exe", that suggests you're on Windows. There's a good reason why you're asked to show output of 'gcc -v' (not 'gcc --version'!): it has configuration info including compiler host system. If you're really on Windows the slowness is probably explained by Windows-specific I/O overheads, e.g. an antivirus intercepting and blocking writes. Your timing info amounts to 1 millisecond per 4KB chunk of output. On Linux the assembler needs 0.007 seconds on my machine, amounting to microseconds per 4KB chunk.
[Bug target/91838] [8/9 Regression] incorrect use of shr and shrx to shift by 64, missed optimization of vector shift
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91838 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- For unsigned int this would be undefined behavior via attempt to shift by 32 (but we fail to emit the corresponding warning). For narrower types (char, short) this seems well-defined.
[Bug target/91824] unnecessary sign-extension after _mm_movemask_epi8 or __builtin_popcount
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91824 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- A related problem was previously discussed in PR 29776 but this is not a duplicate. In that PR part of the problem was with UB of clz/ctz for zero inputs, none of that poses a problem for popcount and pmovmskb.
[Bug rtl-optimization/93402] [8/9/10 Regression] Wrong code when returning padded struct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93402 Alexander Monakov changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2020-01-23 CC||amonakov at gcc dot gnu.org Summary|Wrong code when returning |[8/9/10 Regression] Wrong |padded struct |code when returning padded ||struct Ever confirmed|0 |1 --- Comment #1 from Alexander Monakov --- Broken by postreload-cse, -fdbg-cnt=postreload_cse:0 gives good code.
[Bug middle-end/90348] [8/9/10 Regression] Partition of char arrays is incorrect in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #17 from Alexander Monakov --- I think part of the problem is trying to make "deaths" explicit via CLOBBERs without making "births" also explicit in the IR. Doing both would have allowed a lifetime verifier that checks that births dominate all references to a variable and likewise deaths/clobbers postdominate all references, which would presumably catch this early and make the IR more rigorous overall.
[Bug middle-end/90348] [8/9/10 Regression] Partition of char arrays is incorrect in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348 --- Comment #19 from Alexander Monakov --- (In reply to Michael Matz from comment #18) > represent all accesses indirectly via pointers Would that be necessary in presence of a verifier that ensures that all references are dominated by births?
[Bug target/91838] [8/9 Regression] incorrect use of shr and shrx to shift by 64, missed optimization of vector shift
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91838 --- Comment #5 from Alexander Monakov --- Ah, indeed, it should be explicitly UB, and the documentation should mention that as well as that implicit integer promotion does not happen for vector shifts and other operations.
[Bug tree-optimization/93301] Wrong optimization: instability of uninitialized variables leads to nonsense
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93301 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #7 from Alexander Monakov --- (you mean unreachable code, not dead code) Nice find, this is definitely a bug. As you say, loop unswitching introduces an unconditional use of an uninitialized variable which otherwise is conditional and (might be) never executed. The testcase hits a problematic early-out in is_maybe_undefined: /* Uses in stmts always executed when the region header executes are fine. */ if (dominated_by_p (CDI_DOMINATORS, loop->header, gimple_bb (def))) continue; the code does not match the comment, checking postdominators might be correct, but not dominators. This was introduced by r245057 for PR71691, so technically a 8/9/10 regression. Probably worth splitting into a separate PR as this is more serious and might be more straightforward to fix than the earlier testcases.
[Bug tree-optimization/93444] New: [8/9/10 Regression] unswitching introduces unconditional use of uninitialized variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93444 Bug ID: 93444 Summary: [8/9/10 Regression] unswitching introduces unconditional use of uninitialized variable Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org CC: ch3root at openwall dot com Target Milestone: --- Splitting out bug 93301 comments 6 and 7. __attribute__((noipa)) static int opaque(int i) { return i; } int main() { short x = opaque(1); short y; opaque(x - 1); while (opaque(1)) { __builtin_printf("x = %d; x - 1 = %d\n", x, opaque(1) ? x - 1 : 5); if (opaque(1)) break; if (x - 1 == y) opaque(y); } } Prints "x = 1; x - 1 = 5" at -O3. Loop unswitching introduces an unconditional use of an uninitialized variable which otherwise is conditional and never executed. The testcase hits a problematic early-out in is_maybe_undefined: /* Uses in stmts always executed when the region header executes are fine. */ if (dominated_by_p (CDI_DOMINATORS, loop->header, gimple_bb (def))) continue; the code does not match the comment, checking postdominators might be correct, but not dominators. This was introduced by r245057 for PR71691.
[Bug tree-optimization/93301] Wrong optimization: instability of uninitialized variables leads to nonsense
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93301 --- Comment #8 from Alexander Monakov --- Pasted that to new PR 93444 (should have done that right away, sorry).
[Bug tree-optimization/93444] [8/9/10 Regression] ssa-loop-im introduces unconditional use of uninitialized variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93444 Alexander Monakov changed: What|Removed |Added Summary|[8/9/10 Regression] |[8/9/10 Regression] |unswitching introduces |ssa-loop-im introduces |unconditional use of|unconditional use of |uninitialized variable |uninitialized variable --- Comment #1 from Alexander Monakov --- Actually, scratch that. I should have double-checked the order of arguments to dominated_by_p (and the tree dump). This code is not to blame. The problem starts earlier when tree-ssa-loop-im moves _7 = (int) y_21(D); out of the loop, making the access unconditional when it was conditional in the loop and actually unreachable at runtime. (editing the subject to reflect this, but keeping the regression marker)
[Bug tree-optimization/93444] [8/9/10 Regression] ssa-loop-im introduces unconditional use of uninitialized variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93444 --- Comment #5 from Alexander Monakov --- The problem is lifting a conditional access. We don't have an example where lifting an invariant from an always-executed block in a loop to its preheader poses a problem. LLVM adopted an approach where hoisting must "freeze" the "poisoned" values resulting from uninitialized access so they acquire a concrete unpredictable value: https://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf
[Bug tree-optimization/93491] Wrong optimization: const-function moved over control flow leading to crashes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93491 Alexander Monakov changed: What|Removed |Added Status|WAITING |NEW CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- (In reply to Alexander Cherepanov from comment #2) > > Do you have a testcase were gcc does this optimize without the user adding > > const and still traps? > > No. I'll file a separate bug if I stumble upon one, so please disregard this > possibility for now. GCC will deduce that g is const just fine, it even tells you so with -Wsuggest-attribute=const __attribute__((noipa)) void f(int i) { __builtin_exit(i); } __attribute__((noinline)) int g(int i) { return 1 / i; } int main() { while (1) { f(0); f(g(0)); } } Thus removing WAITING and confirming.
[Bug tree-optimization/93521] 40% slower in O2 than O1 (tree-pre)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93521 Alexander Monakov changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED CC||amonakov at gcc dot gnu.org Resolution|--- |DUPLICATE --- Comment #1 from Alexander Monakov --- Dup. *** This bug has been marked as a duplicate of bug 93056 ***
[Bug tree-optimization/93056] Poor codegen for heapsort in stephanov_vector benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93056 Alexander Monakov changed: What|Removed |Added CC||hehaochen at hotmail dot com --- Comment #2 from Alexander Monakov --- *** Bug 93521 has been marked as a duplicate of this bug. ***
[Bug c++/92572] Vague linkage does not work reliably when a matching segment is in a dynamically linked libarary on Linux
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92572 --- Comment #5 from Alexander Monakov --- GCC is emitting static_local as @gnu_unique_object, so it should be unified by the Glibc dynamic linker. You can use 'nm -CD' to check its type after linking for the main executable and the library to make sure ld keeps it unique, and LD_DEBUG=all (see 'man ld.so') to see how it gets resolved at runtime.
[Bug rtl-optimization/88879] [9 Regression] ICE in sel_target_adjust_priority, at sel-sched.c:3332
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88879 --- Comment #15 from Alexander Monakov --- This should not be reproducible with current HEAD because the assert was simply eliminated. If GCC master definitely fails, can you please provide the exact diagnostic? As for 9.2 this is sadly expected because the patch was not backported, I will backport it soon for the next release from the gcc-9 branch (but if you're building GCC yourself you can easily do it on your end as the patch simply removes the offending assert). Sorry about the trouble.
[Bug tree-optimization/93734] [8/9/10 Regression] Invalid code generated with -O2 -march=haswell -ftree-vectorize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93734 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- I tried to make an equivalent C testcase, but complex ops don't map 1:1 from Fortran, so it's a bit difficult. Nevertheless, here's a somewhat similar testcase that aborts on 8/9, works on trunk, but IR and resulting assembly look quite different: ( needs -O2 -ftree-vectorize -mfma -fcx-limited-range ) __attribute__((noipa)) static _Complex double test(_Complex double * __restrict a, _Complex double * __restrict x, _Complex double t, long jx) { long i, j; for (j = 6, i = 3; i>=0; i--, j-=jx) x[j] -= t*a[i]; return x[4]; } int main() { _Complex double a[5] = {1, 1, 1, 1, 10}; _Complex double x[9] = {1,1,1,1,1,1,1,1,1}; if (test(a, x, 1, 2)) __builtin_abort(); }
[Bug rtl-optimization/88879] [9 Regression] ICE in sel_target_adjust_priority, at sel-sched.c:3332
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88879 Alexander Monakov changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #18 from Alexander Monakov --- Fixed.
[Bug rtl-optimization/93743] [9/10 Regression] swapped arguments in atan2l
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93743 Alexander Monakov changed: What|Removed |Added Target||i?86-*-*, x86_64-*-* Status|UNCONFIRMED |NEW Keywords||wrong-code Last reconfirmed||2020-02-14 Component|c |rtl-optimization CC||amonakov at gcc dot gnu.org Ever confirmed|0 |1 Summary|swapped arguments in atan2l |[9/10 Regression] swapped ||arguments in atan2l --- Comment #1 from Alexander Monakov --- Minimal testcase, needs -ffast-math or -Ofast long double f(long double y, long double x) { return __builtin_atan2l(y, x); } reg-stack seems to get the order of x87 stack pushes wrong Correct output in gcc-8: fldt8(%rsp) fldt24(%rsp) fpatan ret Wrong output in gcc-9/trunk: fldt24(%rsp) fldt8(%rsp) fpatan ret
[Bug rtl-optimization/93743] [9/10 Regression] swapped arguments in atan2l
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93743 Alexander Monakov changed: What|Removed |Added CC||uros at gcc dot gnu.org Component|target |rtl-optimization Target Milestone|9.3 |--- --- Comment #2 from Alexander Monakov --- Looks related to Uros' svn r264648.
[Bug tree-optimization/93745] Redundant store not eliminated with intermediate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93745 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov --- The store cannot be eliminated with GCC's memory model where stores through a pointer can change dynamic type. The extra load you mention can be optimized because if *p and d overlap, attempt to load from a 'double' stored value via an 'int *' would have invoked undefined behavior.
[Bug middle-end/93744] [8/9/10 Regression] Different results between gcc-9 and gcc-7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93744 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- Note that the pattern will get sign of floating-point zero wrong if it ever triggers for an fp type. Also, it's too specific: it at least misses eq/ne comparisons, but generally speaking it doesn't need to be tied to comparisons in the first place: any multiplication by a boolean value can be converted to a select.
[Bug tree-optimization/93745] Redundant store not eliminated with intermediate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93745 --- Comment #4 from Alexander Monakov --- Placement new is translated to a plain pointer assignment on GIMPLE, so optimizers cannot distinguish programs that had placement new from programs that did not. (in C we need memory from malloc to be reusable, so imagine that instead of 'double d' the example had a store via a 'double *')
[Bug gcov-profile/93623] No need to dump gcdas when forking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93623 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- (In reply to calixte from comment #2) > I think the reset is useless in the case of exec** functions since the > counters are lost when an exec** is called. So it can probably be removed > too. exec can fail; resetting only after an (unsuccessful) exec may be ok, but eliding the reset entirely does not seem so.
[Bug c/93848] missing -Warray-bounds warning for array subscript 1 is outside array bounds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93848 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- Note that 6.5.6 takes care to allow unevaluated * operator: If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated. So for example there's no UB in void bar_aux (int *); void foo (void) { int i; int *p = &i; bar_aux (&p[1]); } In your example with 'bar', the formal evaluation in the expression 'p[1]' does not create a copy of the array, it simply strips off one array dimension in the pointed-to type. So I am pretty sure it was not an intention of the standard to make that undefined. Perhaps the standard could be edited to make that clearer, but there's no need to issue a warning here.
[Bug target/93934] Unnecessary fld of uninitialized float stack variable results in ub of valid C++ code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93934 Alexander Monakov changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED CC||amonakov at gcc dot gnu.org Resolution|--- |INVALID --- Comment #2 from Alexander Monakov --- fcmov can only raise an x87 fpu exception on x87 stack underflow, which cannot happen here. Even if it did raise FE_INVALID for SNaNs, note that GCC does not support SNaNs by default; -fsignaling-nans can be specified to request that, but note that documentation says the support is incomplete. No bug here afaict.
[Bug target/93934] Unnecessary fld of uninitialized float stack variable results in ub of valid C++ code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93934 --- Comment #5 from Alexander Monakov --- Ah, indeed. fld won't raise FE_INVALID for 80-bit long double, but here 'result' is stored on the stack in 64-bit format. So: fcmov and 80-bit fldt don't trap, 32-bit flds and 64-bit fldl do. Somehow RTL if-conversion would have to check "-fsignaling-nans is requested and the target may raise FE_INVALID on loads" among other reason to reject a speculative load. I am afraid though that several other optimizations do not anticipate that x87 fp loads can raise exceptions on SNaNs either, making -fsignaling-nans difficult to implement in full.
[Bug target/93934] Unnecessary fld of uninitialized float stack variable results in ub of valid C++ code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93934 --- Comment #8 from Alexander Monakov --- I think regstack is fine as x87 only supports computations in its native 80-bit format and conversions to/from ieee float/double happen only on memory loads/stores. > I suppose a fldt followed by "truncation" to 32/64 bit would then trap at the truncation step? Such "truncation" can only be implemented via a spill/reload on x87, so, yes. > We'd have to mark all loads from not must-initialized memory as possibly > trapping and thus not eligible for if-conversion. (except long double) > And this applies to possibly uninitialized registers > as well which might be spilled or allocated to the stack. Ideally registers should be always spilled in their native 80-bit format, for which the problem does not arise. For C with -fexcess-precision=standard this should already be the case.
[Bug middle-end/56077] [4.6/4.7/4.8 Regression] volatile ignored when function inlined
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56077 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov 2013-02-04 17:25:05 UTC --- The difference in behaviour is due to this change in sched_analyze_insn, inside "if (reg_pending_barrier)": + /* Flush pending lists on jumps, but not on speculative checks. */ + if (JUMP_P (insn) && !(sel_sched_p () + && sel_insn_is_speculation_check (insn))) flush_pending_lists (deps, insn, true, true); The "JUMP_P (insn) && " part in the condition seems to be an unintended change.
[Bug target/56200] queens benchmark is faster with -O0 than with any other optimization level
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200 --- Comment #2 from Alexander Monakov 2013-02-04 21:36:38 UTC --- (In reply to comment #1) > What happens if you also use -fno-ivopts ? For me, -fno-ivopts gives a small improvement, but still slower than -O0. I think the slowdown is related to code layout in the Icache and branch predictors. There is a hot region which is composed of three consecutive conditional branches (cmp-jg-cmp-jg-cmp-jg in optimized code and mov-cmp-jl-mov-cmp-jl-mov-cmp-jl at -O0). If I align the first _and_ the second to a 16-byte boundary, I get better performance then -O0, but aligning only one of those is still slower than -O0: --- o1.s2013-02-05 00:04:44.405072150 +0400 +++ o1h.s2013-02-05 01:17:43.648014420 +0400 @@ -119,9 +119,11 @@ find: movq%rdx, %rbp leal1(%r14), %eax movl%eax, 12(%rsp) +.p2align 4,,7 .L18: cmplfile(%r12), %r14d jg.L17 +.p2align 4,,7 cmpl(%r15,%r12), %r14d jg.L17 cmpl(%rbx), %r14d
[Bug target/56200] queens benchmark is faster with -O0 than with any other optimization level
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56200 Alexander Monakov changed: What|Removed |Added CC||hjl.tools at gmail dot com, ||ubizjak at gmail dot com --- Comment #4 from Alexander Monakov 2013-02-05 09:46:13 UTC --- The need for the first alignment is clear: it aligns the loop to a 16-byte boundary, and gcc does set that alignment at -O2. Uros, H.J., any idea why separating the first conditional jump from the rest by additional alignment is crucial for performance in this case? Is there anything that can be improved in GCC here?
[Bug sanitizer/56393] SIGSEGV when -fsanitize=address and dynamic lib with global objects
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56393 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #14 from Alexander Monakov 2013-02-21 10:54:13 UTC --- (In reply to comment #13) > We've got this problem on Android, where an instrumented JNI library is loaded > into Dalvik VM, which is outside of user control. We "solve" it by requiring > that the runtime library is LD_PRELOAD-ed into the DVM (Android has a > mechanism > to do this on an individual app basis on rooted devices). OT, but what is this mechanism you speak of? Currently this bug is the top google hit for "Dalvik sanitizer LD_PRELOAD", and I don't see how it might work if the VM only forks, not execs.
[Bug c/56507] GCC -march=native for Core2Duo
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56507 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org Resolution|INVALID |DUPLICATE --- Comment #5 from Alexander Monakov 2013-03-04 09:29:32 UTC --- Looks like a duplicate of PR 39851 then. *** This bug has been marked as a duplicate of bug 39851 ***
[Bug other/39851] gcc -Q --help=target does not list extensions selected by -march=
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39851 Alexander Monakov changed: What|Removed |Added CC||bratsinot at gmail dot com --- Comment #4 from Alexander Monakov 2013-03-04 09:29:32 UTC --- *** Bug 56507 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/53265] Warn when undefined behavior implies smaller iteration count
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53265 --- Comment #10 from Alexander Monakov 2013-03-11 16:15:36 UTC --- (In reply to comment #8) > Not sure about the warning wording What about (... "iteration %E invokes undefined behavior", max)? > plus no idea how to call the warning option (-Wnum-loop-iterations, > -Wundefined-behavior-in-loop, something else?) Can it be -Waggressive-loop-optimizations to follow existing pairs of -{W,fno-}strict-{aliasing,overflow} for the recently added -fno-aggressive-loop-optimizations?