[Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 Bug #: 52459 Summary: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt Classification: Unclassified Product: gcc Version: 4.6.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: m8r-ynb...@mailinator.com Created attachment 26808 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26808 testcase gcc 4.6.3 on x86_64-unknown-linux-gnu, running on Core i7 2600K (Sandy Bridge) The attached testcase simply exercises the popcnt instruction over every unsigned int and creates a histogram. But with -O2 -ftree-vectorize or with -O3, the vectorizer adds two popcnt instructions per loop iteration, which makes performance worse than the unoptimized version, and about 3x slower than -Os. Here's the timings and the resulting asm of the loop: With -O0 -m32 -msse4.2: [7.40 seconds] .L2: moveax, DWORD PTR [ebp-12] addDWORD PTR [ebp-12], 1 popcnteax, eax movedx, DWORD PTR [ebp-144+eax*4] addedx, 1 movDWORD PTR [ebp-144+eax*4], edx cmpDWORD PTR [ebp-12], 0 jne.L2 With -O1 -m32 -msse4.2: [2.90 seconds] .L2: leaedx, [eax+1] popcnteax, eax addDWORD PTR [esp+12+eax*4], 1 moveax, edx testedx, edx jne.L2 With -O2 -m32 -msse4.2: [2.91 seconds] .L5: popcntedx, eax movecx, DWORD PTR [esp+12+edx*4] addeax, 1 .L3: addecx, 1 testeax, eax movDWORD PTR [esp+12+edx*4], ecx jne.L5 With -Os -m32 -msse4.2: [2.82 seconds] .L2: popcntedx, eax incDWORD PTR [ebp-136+edx*4] inceax jne.L2 With -O3 -m32 -msse4.2: [8.45 seconds] .L5: popcntedx, eax movedx, DWORD PTR [esp+edx*4] .L3: popcntecx, eax addedx, 1 addeax, 1 movDWORD PTR [esp+ecx*4], edx jne.L5 Things are about the same (relatively) with -m64 but somewhat slower, I'm assuming due to the extra edx -> rdx sign extension step.
[Bug tree-optimization/52459] [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459 --- Comment #1 from M8R-ynb11d at mailinator dot com 2012-03-02 07:11:47 UTC --- Similar (but much slower) results when not using SSE and using the libgcc library version of __builtin_popcount: -O0: 22.55 secs -O1: 20.57 secs -O2: 22.48 secs -Os: 22.81 secs -O3: 45.17 secs
[Bug lto/79587] New: ICE in streamer_write_gcov_count_stream, at data-streamer-out.c:343 while building Python 3.6.0 with PGO and LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79587 Bug ID: 79587 Summary: ICE in streamer_write_gcov_count_stream, at data-streamer-out.c:343 while building Python 3.6.0 with PGO and LTO Product: gcc Version: 6.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: M8R-ynb11d at mailinator dot com Target Milestone: --- Created attachment 40766 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40766&action=edit preprocessed source gcc version 6.3.0 on x86_64 linux - $ gcc-6.3 -v Using built-in specs. COLLECT_GCC=gcc-6.3 COLLECT_LTO_WRAPPER=/home/user/tools/gcc/inst-6.3.0/libexec/gcc/x86_64-pc-linux-gnu/6.3.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-6.3.0/configure --prefix=/home/user/tools/gcc/inst-6.3.0 --program-suffix=-6.3 --enable-languages=c,c++ --enable-checking=release --disable-werror --with-build-config=bootstrap-lto Thread model: posix gcc version 6.3.0 (GCC) - Failing command: - gcc-6.3 -pthread -fPIC -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -march=native -g0 -std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -fprofile-use -fprofile-correction -flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none -DCONFIG_64=1 -DASM=1 -I/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec -I./Include -I. -I/usr/include/x86_64-linux-gnu -I/usr/local/include -I/home/user/tools/python/Python-3.6.0/Include -I/home/user/tools/python/Python-3.6.0 -c /home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c -o build/temp.linux-x86_64-3.6/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.o /home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c: In function ‘crt3’: /home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c:177:1: internal compiler error: in streamer_write_gcov_count_stream, at data-streamer-out.c:343 } ^ Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions. - The software being built is Python 3.6.0 with PGO and LTO enabled. Steps to reproduce: - wget https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tar.xz tar xf Python-3.6.0.tar.xz cd Python-3.6.0 ./configure --prefix=$HOME/tools/python/inst-3.6 --with-lto --enable-optimization CC=gcc-6.3 AR=gcc-ar-6.3 RANLIB=gcc-ranlib-6.3 CFLAGS_NODIST="-march=native -g0" make profile-opt - Note that the problem only manifests when profile data is present, and the Python build system does not stop after the above error and continues to delete the profile data, which makes it impossible to repeat the problem just be re-running the failing command. This can be worked around by running the following series of commands instead of 'make profile-opt': - make clean make profile-removal make build_all_generate_profile make profile-removal make run_profile_task make build_all_merge_profile make clean make build_all_use_profile - The resulting .gcda file and .i preprocessed source are enough to reproduce the problem; here is the reduced invocation with crt.gcda and crt.i (attached): - gcc-6.3 -pthread -fPIC -fwrapv -O3 -std=c99 -fprofile-use -fprofile-correction -flto -c crt.i -o crt.o /home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c: In function ‘crt3’: /home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c:177:1: internal compiler error: in streamer_write_gcov_count_stream, at data-streamer-out.c:343 } ^ Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions. -
[Bug lto/79587] ICE in streamer_write_gcov_count_stream, at data-streamer-out.c:343 while building Python 3.6.0 with PGO and LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79587 --- Comment #1 from M8R-ynb11d at mailinator dot com --- Created attachment 40767 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40767&action=edit profile data
[Bug target/62642] [4.8/4.9/5 Regression] x86 rdtsc is moved through barrier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62642 --- Comment #5 from M8R-ynb11d at mailinator dot com --- I originally put the barriers there in a futile attempt to work around the bug. Can anyone tell me whether I actually need them, or whether the intrinsic carries with it an implicit built-in barrier to prevent reordering? Ideally I'd like to write portable code using only intrinsics and not gcc-specific asm() stuff, so I hope that it's the latter.
[Bug target/62642] New: x86 rdtsc is moved through barrier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62642 Bug ID: 62642 Summary: x86 rdtsc is moved through barrier Product: gcc Version: 4.8.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: M8R-ynb11d at mailinator dot com given: unsigned long long measure(void (*func)(void)) { unsigned long long before = __builtin_ia32_rdtsc(); asm volatile("" ::: "memory"); func(); asm volatile("" ::: "memory"); unsigned long long after = __builtin_ia32_rdtsc(); return after - before; } On x86 linux with -O2, this results in the obviously useless: measure: pushedi pushesi pushebx call[DWORD PTR [esp+16]] rdtsc movesi, eax movedi, edx rdtsc popebx subesi, eax sbbedi, edx moveax, esi movedx, edi popesi popedi ret I can reproduce the problem on 32 bit x86 on Linux and MinGW with 4.8.2 and 4.9.1. (I guess 4.8.0 also exhibits the problem but I don't have that available for testing, so I set the version to 4.8.2 in the report.) 4.7.x and 4.6.x work correctly, as does 64 bit on both platforms. If this is not the proper sanctioned way to write this function, I'm all ears to a better way. I've tried also adding calls to __builtin_ia32_mfence() which as I understand it should not be necessary, and it gets even more comical: ... mfence call[DWORD PTR [esp+16]] mfence rdtsc movesi, eax movedi, edx rdtsc ...
[Bug tree-optimization/63446] New: dangling reference results in confusing diagnostic from -Wuninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63446 Bug ID: 63446 Summary: dangling reference results in confusing diagnostic from -Wuninitialized Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: M8R-ynb11d at mailinator dot com struct foo { int &ref; foo(int &i) : ref(i) {} }; foo make_foo() { int x = 42; return foo(x); } int func() { foo f = make_foo(); return f.ref; } This code is obviously broken due to the dangling reference, so I'm glad gcc gives a warning (clang is silent) but the warning is a bit confusing: $ g++ -O2 -Wall -c wuninit.cpp wuninit.cpp: In function ‘int func()’: wuninit.cpp:15:14: warning: ‘x’ is used uninitialized in this function [-Wuninitialized] return f.ref; ^ I get that the diagnostic is generated after inlining has moved x into func(), but it's still rather confusing as the person that wrote func() might have no knowledge of the internals of make_foo(), and this would be a real head scratcher for them. Additionally, it mentions x being used uninitialized, but x is initialized. (I understand that the initialization becomes dead code and is removed, but that's not immediately obvious.) In an ideal world gcc would warn about the last line of make_foo() instead of func(), and it would mention a dangling reference instead of an uninitialized use.