--- Comment #30 from hjl dot tools at gmail dot com 2009-03-12 20:21
---
Fixed.
--
hjl dot tools at gmail dot com changed:
What|Removed |Added
Status|REOPENE
--- Comment #29 from hjl at gcc dot gnu dot org 2009-03-12 16:08 ---
Subject: Bug 38824
Author: hjl
Date: Thu Mar 12 16:08:02 2009
New Revision: 144817
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144817
Log:
2009-03-12 H.J. Lu
PR target/38824
* config/i386
--- Comment #28 from hjl dot tools at gmail dot com 2009-03-12 16:00
---
(In reply to comment #25)
> patch committed (the changelog was in gcc-patches :-).
>
This patch caused:
http://gcc.gnu.org/ml/gcc/2009-03/msg00340.html
--
hjl dot tools at gmail dot com changed:
--- Comment #27 from bonzini at gnu dot org 2009-02-16 09:14 ---
Added bugs corresponding to the patch fallout in case distros want to backport
it (it gave quite a nice boost and probably fixed PR21676 too)
--
bonzini at gnu dot org changed:
What|Removed
--- Comment #26 from hjl at gcc dot gnu dot org 2009-02-12 15:45 ---
Subject: Bug 38824
Author: hjl
Date: Thu Feb 12 15:45:20 2009
New Revision: 144129
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144129
Log:
Mention PR target/38824 in ChangeLog entries.
Modified:
trunk/g
--- Comment #25 from bonzini at gnu dot org 2009-02-11 08:57 ---
patch committed (the changelog was in gcc-patches :-).
--
bonzini at gnu dot org changed:
What|Removed |Added
-
--- Comment #24 from ubizjak at gmail dot com 2009-02-11 08:14 ---
(In reply to comment #23)
> Even though you don't observe the reporter's slowdown from 4.2/4.3 to
> unpatched 4.4, I guess this makes a good case for the patch. Ok for trunk?
OK with a ChangeLog ;)
BTW: Please watch b
--- Comment #23 from bonzini at gnu dot org 2009-02-11 08:01 ---
Subject: Re: [4.4 Regression] performance regression of
sse code from 4.2/4.3
> [xg...@shgcc-9 38824]$ time ./gcc-42.out
> real0m1.991s
>
> [xg...@shgcc-9 38824]$ time ./gcc-44.out
> real0m1.880s
>
> [xg...@sh
--- Comment #22 from xuepeng dot guo at intel dot com 2009-02-11 07:37
---
(In reply to comment #18)
> Xuepeng, can you test with the loop as produced by my posted patch, that is:
> .L11:
> movaps (%rsi,%rax), %xmm0
> addps %xmm1, %xmm0
> movaps %xmm0, (%rdi,
--- Comment #21 from bonzini at gnu dot org 2009-02-10 16:39 ---
So my patch should be a uniform win.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
--- Comment #20 from dwarak dot rajagopal at amd dot com 2009-02-10 16:28
---
Paulo,
(a) movaps (%rax, %rsi), %xmm0
addps %xmm0, %xmm1
(b) movaps %xmm0, %xmm1
addps (%rax, %rsi), %xmm1
Yes, case (a) is slightly better than case (b). It shouldn't matter much though
--- Comment #19 from bonzini at gnu dot org 2009-02-09 13:37 ---
Also, Dwarak, here the change is not from
addps (%rax, %rsi), %xmm1
to
movps (%rax, %rsi), %xmm0
addps %xmm0, %xmm1
but rather from
movps %xmm0, %xmm1
addps (%rax, %rsi), %xmm1
to the second s
--- Comment #18 from bonzini at gnu dot org 2009-02-09 13:35 ---
Xuepeng, can you test with the loop as produced by my posted patch, that is:
.L11:
movaps (%rsi,%rax), %xmm0
addps %xmm1, %xmm0
movaps %xmm0, (%rdi,%rax)
addq$16, %rax
cmpq
--- Comment #17 from xuepeng dot guo at intel dot com 2009-02-09 09:16
---
Below is a loop in the case in its original form(compiled by GCC 4.4):
_Z7bench_1PfS_fj:
.LFB2309:
shrl$2, %edx
shufps $0, %xmm0, %xmm0
subl$1, %edx
xorl%eax, %eax
--- Comment #16 from hubicka at gcc dot gnu dot org 2009-02-08 12:40
---
Since the splitting peep2 don't seem to be win in general (it wins only when
copy propagation takes place afterwards) and we don't seem to understand what
really makes the testcase faster I am unassigning myself un
--- Comment #15 from hubicka at gcc dot gnu dot org 2009-02-08 12:36
---
I tested the patch on SPECfp and core and there is not much difference. I
guess without somehow tweaking regalloc there is not much to do about this
problem. Xuepeng, if the testcase is core2-variant sensitive, pe
--- Comment #14 from rob1weld at aol dot com 2009-02-07 16:18 ---
(In reply to comment #8)
> Created an attachment (id=17173)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view) [edit]
> An extracted test case for this bug.
>
> Hi tim, I extracted this test case from
--- Comment #13 from dwarak dot rajagopal at amd dot com 2009-02-06 22:35
---
> The patch makes GCC to generate movaps load followed by addps. On Core 2 it
> speeds up the testcase from 7s to 6.2s so I guess it works as expected.
>
> The same however does not reproduce on AMD box and
--- Comment #12 from bonzini at gnu dot org 2009-02-06 09:16 ---
There's another peephole2, namely from
[(set (match_operand 0 "register_operand")
(match_operand 1 "register_operand"))
(set (match_operand 0 "register_operand")
(match_operator 3 "arith_or_logical_operator"
--- Comment #11 from rguenth at gcc dot gnu dot org 2009-01-25 17:56
---
We seem to have a lot of similar "sse performance regression" P2 bugs, can
someone make sure that there are no duplicates here?
--
rguenth at gcc dot gnu dot org changed:
What|Removed
--- Comment #10 from tim at klingt dot org 2009-01-24 13:14 ---
btw, i tried the proposed patch ssef, with no big performance difference:
t...@thinkpad:~/sandbox$ time ./a.out
real0m2.494s
user0m2.473s
sys 0m0.002s
t...@thinkpad:~/sandbox$ time ./a.out
real0m2.479s
us
--- Comment #9 from tim at klingt dot org 2009-01-24 09:56 ---
> Hi tim, I extracted this test case from your website. But I can't exactly
> reproduce this bug on my machine with a core2 quard micor processor. Can you
> help me to check whether my test case is valid firstly? Here I post
--- Comment #8 from xuepeng dot guo at intel dot com 2009-01-24 05:12
---
Created an attachment (id=17173)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view)
An extracted test case for this bug.
Hi tim, I extracted this test case from your website. But I can't exact
--
rguenth at gcc dot gnu dot org changed:
What|Removed |Added
Keywords||missed-optimization
Summary|[4.4 regression] performance|[
--- Comment #7 from hubicka at ucw dot cz 2009-01-15 01:49 ---
Subject: Re: [4.4 regression] performance regression of sse code from 4.2/4.3
I guess th3 main difference here is that load + addps pair generate 2
uops, while mov + loading addps generate 3 since the move has to go
through
--- Comment #6 from hjl dot tools at gmail dot com 2009-01-15 01:25 ---
(In reply to comment #5)
>
> H.J. perhaps, you can have some advice here? Or at least can we do some
> benchmarking?
>
Joey and Xuepeng are looking into it.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3882
--- Comment #5 from hubicka at gcc dot gnu dot org 2009-01-15 00:30 ---
Created an attachment (id=17106)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17106&action=view)
Proposed patch
The patch makes GCC to generate movaps load followed by addps. On Core 2 it
speeds up the testc
--- Comment #4 from hubicka at gcc dot gnu dot org 2009-01-14 20:31 ---
Actually perhaps in simple case like this even peep2 will work since we can
copyprop will fix it later. I am trying to add the peep
--
hubicka at gcc dot gnu dot org changed:
What|Removed
--
rguenth at gcc dot gnu dot org changed:
What|Removed |Added
Keywords||missed-optimization
Summary|[4.4 regression] performance|[
--- Comment #3 from hubicka at gcc dot gnu dot org 2009-01-14 20:20 ---
It might be IRA change. Chips generally preffer separate load and execute
instruction as in the old loop over the load+execute since they are easier to
retire.
Splitting the instruction post reload probably won't d
--- Comment #2 from tim at klingt dot org 2009-01-13 16:22 ---
(In reply to comment #1)
> I don't see how this changes could cause more branch misses. If you do the
> same .palign for the 4.4 code does the regression vanish? I would suspect
> that the loop-stream detector catches one b
--- Comment #1 from rguenth at gcc dot gnu dot org 2009-01-13 15:07 ---
I don't see how this changes could cause more branch misses. If you do the
same .palign for the 4.4 code does the regression vanish? I would suspect
that the loop-stream detector catches one but not the other form
32 matches
Mail list logo