[llvm-bugs] [Bug 40678] New: llvm-config generates options that are incompatible with clang++
https://bugs.llvm.org/show_bug.cgi?id=40678 Bug ID: 40678 Summary: llvm-config generates options that are incompatible with clang++ Product: clang Version: 7.0 Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P Component: Driver Assignee: unassignedclangb...@nondot.org Reporter: michael.finn.jorgen...@gmail.com CC: llvm-bugs@lists.llvm.org, neeil...@live.com, richard-l...@metafoo.co.uk Created attachment 21460 --> https://bugs.llvm.org/attachment.cgi?id=21460&action=edit Output from "llvm-config --cxxflags" I have a minimal C++ file (named test.cc), and I try to compile it with clang++ and llvm-config. I run the command: clang++ `llvm-config --cxxflags` test.cc This command fails with the error: clang-7: error: unknown argument: '-fstack-clash-protection' If I run "llvm-config --cxxflags" alone I see that the problematic option is indeed part of the output. The full output is attached. I believe this is an error, and that the options generated by the llvm-config command should all be accepted by the clang++ command. Additional information: I can compile the given file test.cc without problems if I don't invoke llvm-config, i.e. the following command works without error: clang++ test.cc The versions I'm using are: 7.0.1 for both clang++ and llvm-config. I'm using Fedora 29 (64-bit). P.S. This is all part of a large project (not my own) on github (https://github.com/ghdl/ghdl). -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40679] New: Merge r351322 into the 8.0 branch : [MSan] Apply the ctor creation scheme of TSan
https://bugs.llvm.org/show_bug.cgi?id=40679 Bug ID: 40679 Summary: Merge r351322 into the 8.0 branch : [MSan] Apply the ctor creation scheme of TSan Product: new-bugs Version: 8.0 Hardware: All OS: All Status: NEW Severity: enhancement Priority: P Component: new bugs Assignee: unassignedb...@nondot.org Reporter: philip.pfa...@gmail.com CC: htmldevelo...@gmail.com, llvm-bugs@lists.llvm.org Blocks: 40331 Is it OK to merge the following revision(s) to the 8.0 branch? Referenced Bugs: https://bugs.llvm.org/show_bug.cgi?id=40331 [Bug 40331] [meta] 8.0.0 Release Blockers -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] Issue 10250 in oss-fuzz: llvm: Build failure
Comment #36 on issue 10250 by ClusterFuzz-External: llvm: Build failure https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=10250#c36 Friendly reminder that the the build is still failing. Please try to fix this failure to ensure that fuzzing remains productive. Latest build log: https://oss-fuzz-build-logs.storage.googleapis.com/log-e51246c5-5dd0-434c-bc0f-00e151c28048.txt -- You received this message because: 1. You were specifically CC'd on the issue You may adjust your notification preferences at: https://bugs.chromium.org/hosting/settings Reply to this email to add a comment. ___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 30690] error in backend: Cannot select: masked_gather
https://bugs.llvm.org/show_bug.cgi?id=30690 Simon Pilgrim changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #7 from Simon Pilgrim --- Fixed in trunk -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40657] IR programs printing incorrect results after being compiled with -O0
https://bugs.llvm.org/show_bug.cgi?id=40657 Sanjay Patel changed: What|Removed |Added Status|CONFIRMED |RESOLVED Fixed By Commit(s)||r353615, r353639 Resolution|--- |FIXED --- Comment #10 from Sanjay Patel --- All attached examples should be fixed after: https://reviews.llvm.org/rL353639 Feel free to reopen if that's not correct. Side note about fuzzing for LLVM bugs: If you're looking for backend bugs, it might be worth trying something like this: $ clang -O2 test.c -S -emit-llvm -Xclang -disable-llvm-optzns -o unoptimized_ir.ll $ llc -O2 unoptimized_ir.ll -o unoptimized_ir_optimized_asm.s $ clang unoptimized_ir_optimized_asm.s -o maybe_buggy_executable $ clang test.c -o reference_executable $ { compare output of } reference_executable maybe_buggy_executable A lot of people get that 1st step wrong: you want clang to create an IR file that allows optimization by the backend, but skip intermediate optimization. That doesn't happen if you use "-O0 -emit-llvm" with clang. -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 38971] [X86] Reductions should use smaller vector types later on
https://bugs.llvm.org/show_bug.cgi?id=38971 Sanjay Patel changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED Fixed By Commit(s)||r353641 --- Comment #6 from Sanjay Patel --- We should get the ideal output for this example with or without -fast-hops after: https://reviews.llvm.org/rL353641 https://reviews.llvm.org/D57841 -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40680] New: clang: error: unable to execute command: Segmentation fault when compiling Apache Qpid Proton
https://bugs.llvm.org/show_bug.cgi?id=40680 Bug ID: 40680 Summary: clang: error: unable to execute command: Segmentation fault when compiling Apache Qpid Proton Product: clang Version: trunk Hardware: PC OS: Linux Status: NEW Severity: release blocker Priority: P Component: -New Bugs Assignee: unassignedclangb...@nondot.org Reporter: jda...@redhat.com CC: htmldevelo...@gmail.com, llvm-bugs@lists.llvm.org, neeil...@live.com, richard-l...@metafoo.co.uk Created attachment 21461 --> https://bugs.llvm.org/attachment.cgi?id=21461&action=edit /tmp/bitmask-079a44.c -- Build files have been written to: /home/buildadm/qpid-dispatch/build + CC='clang-9 -fsanitize=thread' + CXX='clang++-9 -fsanitize=thread' + LDFLAGS=-fsanitize=thread + ninja -v install [1/90] cd /home/buildadm/qpid-dispatch/build/src && /usr/bin/python /home/buildadm/qpid-dispatch/build/tests/run.py -s /home/buildadm/qpid-dispatch/src/schema_c.py [2/90] /usr/bin/clang-9 -fsanitize=thread -Dqpid_dispatch_EXPORTS -I../include -Iinclude -I/home/buildadm/qpid-proton/build/install/include -I/usr/include/python2.7 -I../src -I../src/router_core -Isrc -fsanitize=thread -O2 -g -DNDEBUG -fPIC -g -fno-omit-frame-pointer -Werror -Wall -Wpedantic -std=gnu99 -pthread -Wno-gnu-statement-expression -MD -MT src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o -MF src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o.d -o src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o -c /home/buildadm/qpid-dispatch/src/bitmask.c FAILED: src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o /usr/bin/clang-9 -fsanitize=thread -Dqpid_dispatch_EXPORTS -I../include -Iinclude -I/home/buildadm/qpid-proton/build/install/include -I/usr/include/python2.7 -I../src -I../src/router_core -Isrc -fsanitize=thread -O2 -g -DNDEBUG -fPIC -g -fno-omit-frame-pointer -Werror -Wall -Wpedantic -std=gnu99 -pthread -Wno-gnu-statement-expression -MD -MT src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o -MF src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o.d -o src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o -c /home/buildadm/qpid-dispatch/src/bitmask.c Stack dump: 0. Program arguments: /usr/lib/llvm-9/bin/clang -cc1 -triple x86_64-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -discard-value-names -main-file-name bitmask.c -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu x86-64 -dwarf-column-info -debug-info-kind=limited -dwarf-version=4 -debugger-tuning=gdb -momit-leaf-frame-pointer -coverage-notes-file /home/buildadm/qpid-dispatch/build/src/CMakeFiles/qpid-dispatch.dir/bitmask.c.gcno -resource-dir /usr/lib/llvm-9/lib/clang/9.0.0 -dependency-file src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o.d -sys-header-deps -MT src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o -D qpid_dispatch_EXPORTS -I ../include -I include -I /home/buildadm/qpid-proton/build/install/include -I /usr/include/python2.7 -I ../src -I ../src/router_core -I src -D NDEBUG -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-9/lib/clang/9.0.0/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O2 -Werror -Wall -Wpedantic -Wno-gnu-statement-expression -std=gnu99 -fdebug-compilation-dir /home/buildadm/qpid-dispatch/build -ferror-limit 19 -fmessage-length 0 -fsanitize=thread -pthread -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o src/CMakeFiles/qpid-dispatch.dir/bitmask.c.o -x c /home/buildadm/qpid-dispatch/src/bitmask.c -faddrsig 1. clang: error: unable to execute command: Segmentation fault clang: error: clang frontend command failed due to signal (use -v to see invocation) clang version 9.0.0-svn353471-1~exp1+0~20190207214556.1054~1.gbp77e1bc (trunk) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin clang: note: diagnostic msg: PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script. clang: note: diagnostic msg: PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT: Preprocessed source(s) and associated run script(s) are located at: clang: note: diagnostic msg: /tmp/bitmask-079a44.c clang: note: diagnostic msg: /tmp/bitmask-079a44.sh clang: note: diagnostic msg: 2019/02/10 16:21:20 Clang crash detected [3/90] /usr/bin/clang-9 -fsanitize=thread -Dqpid_dispatch_EXPORTS -I../include -Iinclude -I/home/buildadm/qpid-proton/build/install/include -I/usr/include/python2.7 -I../src -I../src/router_core -Isrc -fsanitize=thread -O2 -g -DNDEBUG -fPIC -g -fno-omit-frame-pointer -Werror -Wall -Wpedantic -std=gnu99 -pthread -Wno-gnu-statement-expression -MD -MT src/CMakeFiles/qpid-dispatch
[llvm-bugs] [Bug 40681] New: [X86] LLVM 7.0.x optimises out variable init at -O1
https://bugs.llvm.org/show_bug.cgi?id=40681 Bug ID: 40681 Summary: [X86] LLVM 7.0.x optimises out variable init at -O1 Product: libraries Version: 7.0 Hardware: PC OS: All Status: NEW Severity: enhancement Priority: P Component: Backend: X86 Assignee: unassignedb...@nondot.org Reporter: vit9...@avp.su CC: craig.top...@gmail.com, llvm-bugs@lists.llvm.org, llvm-...@redking.me.uk, spatel+l...@rotateright.com Created attachment 21463 --> https://bugs.llvm.org/attachment.cgi?id=21463&action=edit Test C file LLVM 7.0 generates invalid code optimises out variable zeroing for 32-bit X86 at -O1 or higher when sanitizers are enabled. I was able to reproduce the issue with AddressSanitizer or UndefinedBehaviorSanitizer enabled, yet I believe they are just the trigger point. The IR looks fine, so most likely the issue lies in LLVM itself. The bug is not reproducible on LLVM 8.0 or trunk. If LLVM 7.1 release is abandoned, it should be closed, otherwise I believe it is to be release blocker. Test example is provided in the attachment. Both C file and generated .S file. clang -S -c -target i386-gnu-linux -march=pentium2 -pipe -nostdinc -fno-asynchronous-unwind-tables -O1 -fno-builtin -I. -fno-omit-frame-pointer -m32 -fno-stack-protector -fsanitize=address -c d.c -o d.S Relevant comments for generated asm: pushl %esi ... # implicit-def: $esi ; allocates r temporary in %esi, which is filled with random data ... movl %esi, -16(%ebp) ... calll func1 testl %eax, %eax movl -16(%ebp), %ecx ; writes random data to %ecx cmovsl %eax, %ecx ; if (%eax < 0) %ecx = %eax movl %ecx, -16(%ebp) ; %ecx is returned back to stack ... jns .LBB0_11 → if (%eax < 0) goto 11 jmp .LBB0_19 ... .LBB0_19: ... movl -16(%ebp), %eax ; function returns random data when func1 returns >= 0 ... ret -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40682] New: clang++ miscompilation while generating copy constructors of classes containing 0-length array
https://bugs.llvm.org/show_bug.cgi?id=40682 Bug ID: 40682 Summary: clang++ miscompilation while generating copy constructors of classes containing 0-length array Product: clang Version: trunk Hardware: PC OS: All Status: NEW Severity: normal Priority: P Component: C++ Assignee: unassignedclangb...@nondot.org Reporter: joran.biga...@gmail.com CC: blitzrak...@gmail.com, dgre...@apple.com, erik.pilking...@gmail.com, llvm-bugs@lists.llvm.org, richard-l...@metafoo.co.uk The auto-generated copy constructor of a struct containing a 0-length array followed by exactly 1 trivially copyable field will not copy the later. This holds for copy assignments. Here is a simple example: === #include struct NonTrivial { int n; NonTrivial& operator=(NonTrivial o) { this->n = o.n; return *this; } }; struct S { NonTrivial _a; // to force clang to generate a copy assignment constructor int ok; // not mandatory, only to show other trivial fields are still copied int _b[0]; int bugged; // not copied by the auto-generated copy assignment constructor, unless directly followed by another non-trivial field }; int main() { S foo; foo.ok = 11; foo.bugged = 22; S bar; bar.ok = 9876; bar.bugged = 4321; bar = foo; printf("%d %d ; %d %d\n", foo.ok, foo.bugged, bar.ok, bar.bugged); // expected: 11 22 ; 11 22 // output: 11 22 ; 11 4321 return 0; } === -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] Issue 11097 in oss-fuzz: llvm/llvm-isel-fuzzer--x86_64-O2: Timeout in llvm_llvm-isel-fuzzer--x86_64-O2
Updates: Labels: -Reproducible Unreproducible Comment #6 on issue 11097 by ClusterFuzz-External: llvm/llvm-isel-fuzzer--x86_64-O2: Timeout in llvm_llvm-isel-fuzzer--x86_64-O2 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11097#c6 ClusterFuzz testcase 5642269969874944 appears to be flaky, updating reproducibility label. -- You received this message because: 1. You were specifically CC'd on the issue You may adjust your notification preferences at: https://bugs.chromium.org/hosting/settings Reply to this email to add a comment. ___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40681] [X86] LLVM 7.0.x optimises out variable init at -O1
https://bugs.llvm.org/show_bug.cgi?id=40681 vit9696 changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |INVALID --- Comment #3 from vit9696 --- Thanks for that catch and sorry. After a subsequent review, I discovered that the issue was lost during the minimisation. However, a second attempt to minimise it actually showed that the issue only exists in our local copy, and not in upstream. Closing this as invalid. -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] Issue 13044 in oss-fuzz: llvm/clang-fuzzer: Stack-overflow in clang::Parser::ParseConstantExpressionInExprEvalContext
Status: New Owner: CC: k...@google.com, masc...@google.com, jdevlieg...@apple.com, igm...@gmail.com, mit...@google.com, eney...@google.com, llvm-b...@lists.llvm.org, j...@chromium.org, v...@apple.com, mitchphi...@outlook.com, xpl...@gmail.com, akils...@apple.com Labels: ClusterFuzz Stability-Memory-AddressSanitizer Reproducible Engine-libfuzzer Proj-llvm Reported-2019-02-11 Type: Bug New issue 13044 by ClusterFuzz-External: llvm/clang-fuzzer: Stack-overflow in clang::Parser::ParseConstantExpressionInExprEvalContext https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13044 Detailed report: https://oss-fuzz.com/testcase?key=5687174530334720 Project: llvm Fuzzer: libFuzzer_llvm_clang-fuzzer Fuzz target binary: clang-fuzzer Job Type: libfuzzer_asan_llvm Platform Id: linux Crash Type: Stack-overflow Crash Address: 0x7ffc296c8f80 Crash State: clang::Parser::ParseConstantExpressionInExprEvalContext clang::Parser::ParseTemplateArgument clang::Parser::ParseTemplateArgumentList Sanitizer: address (ASAN) Reproducer Testcase: https://oss-fuzz.com/download?testcase_id=5687174530334720 Issue filed automatically. See https://github.com/google/oss-fuzz/blob/master/docs/reproducing.md for instructions to reproduce this bug locally. When you fix this bug, please * mention the fix revision(s). * state whether the bug was a short-lived regression or an old bug in any stable releases. * add any other useful information. This information can help downstream consumers. If you need to contact the OSS-Fuzz team with a question, concern, or any other feedback, please file an issue at https://github.com/google/oss-fuzz/issues. -- You received this message because: 1. You were specifically CC'd on the issue You may adjust your notification preferences at: https://bugs.chromium.org/hosting/settings Reply to this email to add a comment. ___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40683] New: [OptimizePHIs + Expensive Checks] Bad machine code: Virtual register killed in block, but needed live out.
https://bugs.llvm.org/show_bug.cgi?id=40683 Bug ID: 40683 Summary: [OptimizePHIs + Expensive Checks] Bad machine code: Virtual register killed in block, but needed live out. Product: libraries Version: trunk Hardware: PC OS: Linux Status: NEW Severity: enhancement Priority: P Component: Common Code Generator Code Assignee: unassignedb...@nondot.org Reporter: pauls...@linux.vnet.ibm.com CC: llvm-bugs@lists.llvm.org Created attachment 21465 --> https://bugs.llvm.org/attachment.cgi?id=21465&action=edit reduced testcase bin/llc -mcpu=z10 -O3 tc_vregkill_liveout.ll -o - *** Bad machine code: Virtual register killed in block, but needed live out. *** - function:main - basic block: %bb.2 (0x2aa66bceb88) Virtual register %1 is used after the block. LLVM ERROR: Found 1 machine code errors. It seems that "Optimize machine instruction PHIs" is transforming this block, which is inside a loop: bb.2 (%ir-block.3): ; predecessors: %bb.1, %bb.3 successors: %bb.4(0x4000), %bb.5(0x4000); %bb.4(50.00%), %bb.5(50.00%) %3:gr32bit = PHI %1:gr32bit, %bb.1, %0:gr32bit, %bb.3 %4:gr32bit = IMPLICIT_DEF CR killed %3:gr32bit, %4:gr32bit, implicit-def $cc %5:gr32bit = LHI 0 %6:gr32bit = LHI -10 BRC 14, 4, %bb.4, implicit $cc to bb.2 (%ir-block.3): ; predecessors: %bb.1, %bb.3 successors: %bb.4(0x4000), %bb.5(0x4000); %bb.4(50.00%), %bb.5(50.00%) %4:gr32bit = IMPLICIT_DEF CR killed %1:gr32bit, %4:gr32bit, implicit-def $cc %5:gr32bit = LHI 0 %6:gr32bit = LHI -10 BRC 14, 4, %bb.4, implicit $cc Both %1 and %0 are defined outside the loop, and %0 is a copy of %1. %1 should not have the kill flag since it is used in the next iteration as well. -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40684] New: Merge r353656 to the 8.0 branch
https://bugs.llvm.org/show_bug.cgi?id=40684 Bug ID: 40684 Summary: Merge r353656 to the 8.0 branch Product: clang Version: 8.0 Hardware: Macintosh OS: OpenBSD Status: NEW Severity: normal Priority: P Component: LLVM Codegen Assignee: unassignedclangb...@nondot.org Reporter: b...@comstyle.com CC: llvm-bugs@lists.llvm.org, neeil...@live.com, richard-l...@metafoo.co.uk Blocks: 40331 Merge r353656 back to the 8.0 branch. OpenBSD/NetBSD/PPC diff for proper long double setting. Referenced Bugs: https://bugs.llvm.org/show_bug.cgi?id=40331 [Bug 40331] [meta] 8.0.0 Release Blockers -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40553] wrong sizeof(long double) in 32-bit PowerPC NetBSD, OpenBSD
https://bugs.llvm.org/show_bug.cgi?id=40553 Brad Smith changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED CC||b...@comstyle.com --- Comment #1 from Brad Smith --- Commited r353656. -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40685] New: [5, 6, 7, 8, 9 regression] auto-vectorization unpacks, repacks, and unpacks to 32-bit again for count += (bool_arr[i]==0) for boolean array, using 3x the shuffles needed
https://bugs.llvm.org/show_bug.cgi?id=40685 Bug ID: 40685 Summary: [5,6,7,8,9 regression] auto-vectorization unpacks, repacks, and unpacks to 32-bit again for count += (bool_arr[i]==0) for boolean array, using 3x the shuffles needed Product: new-bugs Version: trunk Hardware: PC OS: Linux Status: NEW Keywords: performance, regression Severity: enhancement Priority: P Component: new bugs Assignee: unassignedb...@nondot.org Reporter: pe...@cordes.ca CC: htmldevelo...@gmail.com, llvm-bugs@lists.llvm.org int count(const bool *visited, int len) { int counter = 0; for(int i=0;i<100;i++) { // len unused or not doesn't matter if (visited[i]==0) counter++; } return counter; } (adapted from: https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a) I expected compilers not to notice that byte elements wouldn't overflow (and make code that unpacks to dword inside the loop), and probably to fail to use psadbw to hsum bytes inside a loop. (ICC does that, gcc and MSVC just go scalar.) But I didn't expect clang to pack back down to bytes with pshufb after PXOR, before redoing the expansion to dword with another PMOVZX. (This is a regression from clang4.0.1) https://godbolt.org/z/1SEmTu # clang version 9.0.0 (trunk 353629) on Godbolt # -O3 -Wall -march=haswell -fno-unroll-loops -mno-avx count(bool const*, int): pxorxmm0, xmm0 xor eax, eax movdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = [1,1,1,1] movdqa xmm2, xmmword ptr [rip + .LCPI0_1] # xmm2 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u> .LBB0_1:# =>This Inner Loop Header: Depth=1 pmovzxbdxmm3, dword ptr [rdi + rax] # xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero pxorxmm3, xmm1 pshufb xmm3, xmm2 pmovzxbdxmm3, xmm3 # xmm3 = xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero paddd xmm0, xmm3 add rax, 4 cmp rax, 100 jne .LBB0_1 ... horizontal sum Unrolling just repeats this pattern -march=haswell -mno-avx is basically the same. -march=haswell *with* AVX2 does slightly better, only unpacking to 16-bit elements in an XMM before repacking, otherwise it would have needed a lane-crossing byte shuffle to pack back to bytes for vpmovzxbd ymm, xmm. So it looks like something really wants to fill up a whole XMM before flipping bits with PXOR, instead of just flipping packed bits in an XMM with high garbage. If you're going to unpack though, you might as well just flip unpacked booleans so you can load with pmovzx. movd + pxor would be worse, especially on CPUs other than Intel SnB-family where an indexed addressing mode for pmovzx saves front-end bandwidth vs. a separate load. The pshufb + 2nd pmovzxbd can literally be removed with zero change to the result, because xmm1 = set1_epi32(1). pmovzxbd xmm3, dword ptr [rdi + rax]; un-laminates on SnB including HSW/SKL pxor xmm3, xmm1 paddd xmm0, xmm3 Of course, avoiding a non-indexed addressing mode would also be a good thing when tuning for Haswell. Clang/LLVM still use indexed for -march=haswell, costing an extra uop from un-lamination (pmovzx destination is write-only, so it always unlaminates an indexed addressing mode. vpmovzx can't micro-fuse with a ymm destination, but it can with an xmm destination.) We could also consider unpacking against zero with punpcklbw / hbw to feed 2x punpcklwd / hwd, but that saves PXOR instructions and load uops at the cost of more shuffle uops (6 instead of 4 to get 4 dword vectors). -- This changed between Clang 4.0.1 and clang 5.0: # clang4.0.1 inner loop pmovzxbdxmm3, dword ptr [rdi + rax] # xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero pxorxmm3, xmm1 # ^= set1_epi32(1) pandxmm3, xmm2 # &= set1_epi32(255) paddd xmm0, xmm3 This is less bad (3x the shuffle-port bottleneck on Haswell/Skylake), so this is a regression. ## Other missed optimizations reporting separately, will link the bug number here for LLVM's failure to efficiently sum 8-bit elements with PSADBW and so on. -- You are receiving this mail because: You are on the CC list for the bug.___ llvm-bugs mailing list llvm-bugs@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
[llvm-bugs] [Bug 40686] New: Use PSADBW for horizontal uint8_t byte sums (and accumulate multiple booleans before using it), instead of widening right away
https://bugs.llvm.org/show_bug.cgi?id=40686 Bug ID: 40686 Summary: Use PSADBW for horizontal uint8_t byte sums (and accumulate multiple booleans before using it), instead of widening right away Product: new-bugs Version: trunk Hardware: PC OS: Linux Status: NEW Keywords: performance Severity: enhancement Priority: P Component: new bugs Assignee: unassignedb...@nondot.org Reporter: pe...@cordes.ca CC: htmldevelo...@gmail.com, llvm-bugs@lists.llvm.org Same code as PR40685, but this bug is about the other major optimizations that are possible, not the silly extra shuffles: // https://godbolt.org/z/1SEmTu (same Godbolt link as the other PR) int count(const bool *visited, int len) { int counter = 0; for(int i=0;i<100;i++) { // len unused or not doesn't matter if (visited[i]==0) counter++; } return counter; } (adapted from: https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a) At best, once Bug 40685 is fixed, clang / LLVM is probably doing something like pmovzxbdxmm3, dword ptr [rdi + rax] pxorxmm3, xmm1# flip the bits paddd xmm0, xmm3 This costs us a shuffle and 2 other ALU instructions per 4 bools, regardless of unrolling. ## Other missed optimizations For large arrays, we only need ~1.25 vector ALU instruction (and a pure load) plus loop overhead per 16 / 32 / 64 bools (1 per vector with an extra ALU amortized over a minor unroll by 4). See below. Instead of flipping inside the loop, we can do for() notcount += arr[i]; return 100-notcount; We can prove that this can't overflow notcount, because len is small enough and bool->int can only be 0 or 1. Byte counters won't overflow for a compile-time-constant array size of 100, so we can simply sum outside the loop. count: # for hard-coded size=100, fully unrolled vmovdqu ymm0, [rdi] vpaddb ymm0, [rdi+32] # asm syntax leaving out first src = dst operand vpaddb ymm0, [rdi+64] ; 3 * 32 = 96 elements fully unrolled movd xmm1, [rdi] vpaddbymm0, ymm1 # add the last 4 elements # then hsum the 32x byte accumulators vpxor xmm1, xmm1 vpsadbwymm0, ymm1# hsum unsigned bytes into 64-bit elements vextracti128 xmm1, ymm0, 1 vpaddd xmm0, xmm1 vpunpckhqdq xmm1, xmm0,xmm0 # saves an imm8 of code size vs. vpshufd vpaddd xmm0, xmm1 # and do 100 - that vmovd edx, xmm0 # original source used int, so it's only a 32-bit result moveax, 100 subeax, edx ret Probably actually best to vextracti128 + paddb first, then psadbw xmm (if we care about Excavator / Ryzen), but that would make overflow of byte elements possible for sizes half as large. If we're using that hsum of bytes as a canned sequence, probably easiest to have one that works for all sizes up to 255 vectors. It's a minor difference, like 1 extra ALU uop. 32 * 255 = 8160 fits in 16 bits, so it really doesn't matter what SIMD element size we use on the result of psadbw. PADDQ is slower than PADDD/W/B on Atom/Silvermont including Goldmont Plus (non-AVX CPUs only), so we might as well avoid it so the same auto-vectorization pattern is good with just SSE2 / SSE4 on those CPUs. Signed int overflow is undefined behaviour, but this is counting something different than the source; the C abstract machine never has a sum of the true elements in an `int`. So we'd better only do it in a way that can't overflow. ## For unknown array sizes, we can use psadbw inside the inner loop. e.g. (Without AVX, we'd either need an alignment guarantee or separate movdqu loads, so for large len reaching an alignment boundary could become valuable to avoid front-end bottlenecks.) vpxorymm1, ymm1 .loop vmovdqu ymm0, [rdi] vpaddb ymm0, [rdi+32] # leaving out first src = dst vpaddb ymm0, [rdi+64] sub rdi, -128 vpaddb ymm0, [rdi + 96 - 128] # still using a disp8 addressing mode vpsadbw ymm0, ymm1 # hsum bytes to qword elements vpaddq ymm2, ymm0 # accumulate into a qword vector cmp rdi, rsi jb .loop return len - hsum(ymm2) Or if we can't / don't want to sink the boolean inversion out of the loop, we can start with a vector of set1_epi8( 4 ) and do vpsubb ymm0, ymm3, [rdi] ; 4 - v0. ymm3 = set1(4) vpsubb ymm0, ymm0, [rdi+32] ; . - v1 ... Or other even-less optimal ways of flipping inside the loop before summing. In the general case of counting compare results, vpcmpeqb / vpsubb is good. --- For hot loops we can put off psadbw for up to 255 iterations (254 vpaddb) without