[Bug tree-optimization/52459] New: [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt

2012-03-01 Thread M8R-ynb11d at mailinator dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

 Bug #: 52459
   Summary: [x86] loop vectorization performance very bad (worse
than -O0) when using sse4.2 popcnt
Classification: Unclassified
   Product: gcc
   Version: 4.6.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: m8r-ynb...@mailinator.com


Created attachment 26808
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26808
testcase

gcc 4.6.3 on x86_64-unknown-linux-gnu, running on Core i7 2600K (Sandy Bridge)

The attached testcase simply exercises the popcnt instruction over every
unsigned int and creates a histogram.  But with -O2 -ftree-vectorize or with
-O3, the vectorizer adds two popcnt instructions per loop iteration, which
makes performance worse than the unoptimized version, and about 3x slower than
-Os.

Here's the timings and the resulting asm of the loop:

With -O0 -m32 -msse4.2: [7.40 seconds]
.L2:
moveax, DWORD PTR [ebp-12]
addDWORD PTR [ebp-12], 1
popcnteax, eax
movedx, DWORD PTR [ebp-144+eax*4]
addedx, 1
movDWORD PTR [ebp-144+eax*4], edx
cmpDWORD PTR [ebp-12], 0
jne.L2


With -O1 -m32 -msse4.2: [2.90 seconds]
.L2:
leaedx, [eax+1]
popcnteax, eax
addDWORD PTR [esp+12+eax*4], 1
moveax, edx
testedx, edx
jne.L2


With -O2 -m32 -msse4.2: [2.91 seconds]
.L5:
popcntedx, eax
movecx, DWORD PTR [esp+12+edx*4]
addeax, 1
.L3:
addecx, 1
testeax, eax
movDWORD PTR [esp+12+edx*4], ecx
jne.L5


With -Os -m32 -msse4.2: [2.82 seconds]
.L2:
popcntedx, eax
incDWORD PTR [ebp-136+edx*4]
inceax
jne.L2


With -O3 -m32 -msse4.2: [8.45 seconds]
.L5:
popcntedx, eax
movedx, DWORD PTR [esp+edx*4]
.L3:
popcntecx, eax
addedx, 1
addeax, 1
movDWORD PTR [esp+ecx*4], edx
jne.L5


Things are about the same (relatively) with -m64 but somewhat slower, I'm
assuming due to the extra edx -> rdx sign extension step.


[Bug tree-optimization/52459] [x86] loop vectorization performance very bad (worse than -O0) when using sse4.2 popcnt

2012-03-01 Thread M8R-ynb11d at mailinator dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52459

--- Comment #1 from M8R-ynb11d at mailinator dot com 2012-03-02 07:11:47 UTC ---
Similar (but much slower) results when not using SSE and using the libgcc
library version of __builtin_popcount:

-O0: 22.55 secs
-O1: 20.57 secs
-O2: 22.48 secs
-Os: 22.81 secs
-O3: 45.17 secs


[Bug lto/79587] New: ICE in streamer_write_gcov_count_stream, at data-streamer-out.c:343 while building Python 3.6.0 with PGO and LTO

2017-02-17 Thread M8R-ynb11d at mailinator dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79587

Bug ID: 79587
   Summary: ICE in streamer_write_gcov_count_stream, at
data-streamer-out.c:343 while building Python 3.6.0
with PGO and LTO
   Product: gcc
   Version: 6.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: M8R-ynb11d at mailinator dot com
  Target Milestone: ---

Created attachment 40766
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40766&action=edit
preprocessed source

gcc version 6.3.0 on x86_64 linux

-
$ gcc-6.3 -v
Using built-in specs.
COLLECT_GCC=gcc-6.3
COLLECT_LTO_WRAPPER=/home/user/tools/gcc/inst-6.3.0/libexec/gcc/x86_64-pc-linux-gnu/6.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-6.3.0/configure
--prefix=/home/user/tools/gcc/inst-6.3.0 --program-suffix=-6.3
--enable-languages=c,c++ --enable-checking=release --disable-werror
--with-build-config=bootstrap-lto
Thread model: posix
gcc version 6.3.0 (GCC) 
-

Failing command:

-
gcc-6.3 -pthread -fPIC -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv
-O3 -Wall -Wstrict-prototypes -march=native -g0 -std=c99 -Wextra
-Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers
-fprofile-use -fprofile-correction -flto -fuse-linker-plugin -ffat-lto-objects
-flto-partition=none -DCONFIG_64=1 -DASM=1
-I/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec -I./Include
-I. -I/usr/include/x86_64-linux-gnu -I/usr/local/include
-I/home/user/tools/python/Python-3.6.0/Include
-I/home/user/tools/python/Python-3.6.0 -c
/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c -o
build/temp.linux-x86_64-3.6/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.o
/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c: In
function ‘crt3’:
/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c:177:1:
internal compiler error: in streamer_write_gcov_count_stream, at
data-streamer-out.c:343
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
-

The software being built is Python 3.6.0 with PGO and LTO enabled.  Steps to
reproduce:

-
wget https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tar.xz
tar xf Python-3.6.0.tar.xz
cd Python-3.6.0
./configure --prefix=$HOME/tools/python/inst-3.6 --with-lto
--enable-optimization CC=gcc-6.3 AR=gcc-ar-6.3 RANLIB=gcc-ranlib-6.3
CFLAGS_NODIST="-march=native -g0"
make profile-opt
-

Note that the problem only manifests when profile data is present, and the
Python build system does not stop after the above error and continues to delete
the profile data, which makes it impossible to repeat the problem just be
re-running the failing command.  This can be worked around by running the
following series of commands instead of 'make profile-opt':

-
make clean
make profile-removal
make build_all_generate_profile
make profile-removal
make run_profile_task
make build_all_merge_profile
make clean
make build_all_use_profile
-

The resulting .gcda file and .i preprocessed source are enough to reproduce the
problem; here is the reduced invocation with crt.gcda and crt.i (attached):

-
gcc-6.3 -pthread -fPIC -fwrapv -O3 -std=c99 -fprofile-use -fprofile-correction
-flto -c crt.i -o crt.o
/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c: In
function ‘crt3’:
/home/user/tools/python/Python-3.6.0/Modules/_decimal/libmpdec/crt.c:177:1:
internal compiler error: in streamer_write_gcov_count_stream, at
data-streamer-out.c:343
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
-

[Bug lto/79587] ICE in streamer_write_gcov_count_stream, at data-streamer-out.c:343 while building Python 3.6.0 with PGO and LTO

2017-02-17 Thread M8R-ynb11d at mailinator dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79587

--- Comment #1 from M8R-ynb11d at mailinator dot com ---
Created attachment 40767
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40767&action=edit
profile data

[Bug target/62642] [4.8/4.9/5 Regression] x86 rdtsc is moved through barrier

2014-12-17 Thread M8R-ynb11d at mailinator dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62642

--- Comment #5 from M8R-ynb11d at mailinator dot com ---
I originally put the barriers there in a futile attempt to work around the bug.
 Can anyone tell me whether I actually need them, or whether the intrinsic
carries with it an implicit built-in barrier to prevent reordering?  Ideally
I'd like to write portable code using only intrinsics and not gcc-specific
asm() stuff, so I hope that it's the latter.


[Bug target/62642] New: x86 rdtsc is moved through barrier

2014-09-01 Thread M8R-ynb11d at mailinator dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62642

Bug ID: 62642
   Summary: x86 rdtsc is moved through barrier
   Product: gcc
   Version: 4.8.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: M8R-ynb11d at mailinator dot com

given:

unsigned long long measure(void (*func)(void))
{
unsigned long long before = __builtin_ia32_rdtsc();
asm volatile("" ::: "memory");
func();
asm volatile("" ::: "memory");
unsigned long long after = __builtin_ia32_rdtsc();
return after - before;
}

On x86 linux with -O2, this results in the obviously useless:

measure:
pushedi
pushesi
pushebx
call[DWORD PTR [esp+16]]
rdtsc
movesi, eax
movedi, edx
rdtsc
popebx
subesi, eax
sbbedi, edx
moveax, esi
movedx, edi
popesi
popedi
ret

I can reproduce the problem on 32 bit x86 on Linux and MinGW with 4.8.2 and
4.9.1. (I guess 4.8.0 also exhibits the problem but I don't have that available
for testing, so I set the version to 4.8.2 in the report.)

4.7.x and 4.6.x work correctly, as does 64 bit on both platforms.

If this is not the proper sanctioned way to write this function, I'm all ears
to a better way.  I've tried also adding calls to __builtin_ia32_mfence() which
as I understand it should not be necessary, and it gets even more comical:

...
mfence
call[DWORD PTR [esp+16]]
mfence
rdtsc
movesi, eax
movedi, edx
rdtsc
...


[Bug tree-optimization/63446] New: dangling reference results in confusing diagnostic from -Wuninitialized

2014-10-02 Thread M8R-ynb11d at mailinator dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63446

Bug ID: 63446
   Summary: dangling reference results in confusing diagnostic
from -Wuninitialized
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: M8R-ynb11d at mailinator dot com

struct foo {
int &ref;
foo(int &i) : ref(i) {}
};

foo make_foo()
{
int x = 42;
return foo(x);
}

int func()
{
foo f = make_foo();
return f.ref;
}

This code is obviously broken due to the dangling reference, so I'm glad gcc
gives a warning (clang is silent) but the warning is a bit confusing:

$ g++ -O2 -Wall -c wuninit.cpp
wuninit.cpp: In function ‘int func()’:
wuninit.cpp:15:14: warning: ‘x’ is used uninitialized in this function
[-Wuninitialized]
 return f.ref;
  ^

I get that the diagnostic is generated after inlining has moved x into func(),
but it's still rather confusing as the person that wrote func() might have no
knowledge of the internals of make_foo(), and this would be a real head
scratcher for them.  Additionally, it mentions x being used uninitialized, but
x is initialized.  (I understand that the initialization becomes dead code and
is removed, but that's not immediately obvious.)

In an ideal world gcc would warn about the last line of make_foo() instead of
func(), and it would mention a dangling reference instead of an uninitialized
use.