/pop in pro/epilogue for modern CPUs
> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the dat
Sharif
Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically significant
regression seen. There is one improvement of about +1.9%
(v8 benchmark
Ahmad has helped doing some atom performance testing (ChromeOS
benchmarks) with this patch. In summary, there is no statistically
significant regression seen. There is one improvement of about +1.9%
(v8 benchmark) which looks real.
David
On Wed, Dec 12, 2012 at 9:24 AM, Xinliang David Li wrote:
On Thu, Dec 20, 2012 at 7:06 AM, Jan Hubicka wrote:
>> > Hi Areg,
>> >
>> > Did you mean inlined memcpy/memset are as fast as
>> > the ones in libc.so on both ia32 and Intel64?
>>
>> I would be interested in output of the stringop script.
>
> Also as far as I can remember, none of spec2k6 benchmar
> > Hi Areg,
> >
> > Did you mean inlined memcpy/memset are as fast as
> > the ones in libc.so on both ia32 and Intel64?
>
> I would be interested in output of the stringop script.
Also as far as I can remember, none of spec2k6 benchmarks is really stringop
bound. On Spec2k GCC was quite bound
> Hi Areg,
>
> Did you mean inlined memcpy/memset are as fast as
> the ones in libc.so on both ia32 and Intel64?
I would be interested in output of the stringop script.
>
> Please keep in mind that memcpy/memset in libc.a
> may not be optimized. You must not use -static for
> linking.
In my se
David Li; GCC Patches; Teresa Johnson;
> Melik-adamyan, Areg
> Subject: Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
>
> On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka wrote:
>>> > Here we speak about memcpy/memset only. I never got around to
>
push/pop in pro/epilogue for modern CPUs
On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka wrote:
>> > Here we speak about memcpy/memset only. I never got around to
>> > modernize strlen and friends, unfortunately...
>> >
>> > memcmp and friends are differ
> > me libc starts to be win only for rather large blocks (i.e. >8KB)
> >
>
> Which glibc are you using?
2.15 as it comes with opensuse 12.2
Honza
>
> --
> H.J.
On Thu, Dec 13, 2012 at 12:40 PM, Jan Hubicka wrote:
>> > Here we speak about memcpy/memset only. I never got around to modernize
>> > strlen and friends, unfortunately...
>> >
>> > memcmp and friends are different beats. They realy need some TLC...
>>
>> memcpy and memset in glibc are also extr
> > Here we speak about memcpy/memset only. I never got around to modernize
> > strlen and friends, unfortunately...
> >
> > memcmp and friends are different beats. They realy need some TLC...
>
> memcpy and memset in glibc are also extremely fast.
The default strategy now is to inline only whe
On Thu, Dec 13, 2012 at 12:26 PM, Jan Hubicka wrote:
>> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek wrote:
>> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote:
>> >> >> > libcall is not faster up to 8KB to rep seque
> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek wrote:
> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote:
> >> >> > libcall is not faster up to 8KB to rep sequence that is better for
> >> >> > regalloc/code
> >> >> >
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for
>> >> > regalloc/code
>> >> > cache than f
> Try the following one. 1) -minline-all-stringops
> -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
> -O2.
>
> David
>
>
> #include
> #include
> #include
> #ifndef LEN
> #define LEN 16
> #endif
>
> void copy(char* s1, char* s2,int len) __attribute__((noinline));
> void c
On Thu, Dec 13, 2012 at 7:21 AM, Jakub Jelinek wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for
>> >> > regalloc/code
>> >> > cache than fu
Try the following one. 1) -minline-all-stringops
-mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
-O2.
David
#include
#include
#include
#ifndef LEN
#define LEN 16
#endif
void copy(char* s1, char* s2,int len) __attribute__((noinline));
void copy(char* s1, char* s2,int len)
On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote:
> >> > libcall is not faster up to 8KB to rep sequence that is better for
> >> > regalloc/code
> >> > cache than fully blowin function call.
> >>
> >> Be careful with this. My
On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka wrote:
>> > libcall is not faster up to 8KB to rep sequence that is better for
>> > regalloc/code
>> > cache than fully blowin function call.
>>
>> Be careful with this. My recollection is that REP sequence is good for
>> any size -- for smaller size,
> > libcall is not faster up to 8KB to rep sequence that is better for
> > regalloc/code
> > cache than fully blowin function call.
>
> Be careful with this. My recollection is that REP sequence is good for
> any size -- for smaller size, the REP initial set up cost is too high
> (10s of cycles),
On Wed, Dec 12, 2012 at 4:16 PM, Xinliang David Li wrote:
> On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka wrote:
>> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
>> prologue should be a win.
>>> > Index: config/i386/i386.c
>>> > ===
On Wed, Dec 12, 2012 at 10:30 AM, Jan Hubicka wrote:
> Concerning 1push per cycle, I think it is same as K7 hardware did, so move
> prologue should be a win.
>> > Index: config/i386/i386.c
>> > ===
>> > --- config/i386/i386.c (revisi
Andi Kleen writes:
>
>>> >/* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict
>>> > more
>>> > than 4 branch instructions in the 16 byte window. */
>>> > - m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
>>> > + m_PPRO | m_P4_NOCONA | m_ATOM |
> Jan Hubicka writes:
> >
> > libcall is not faster up to 8KB to rep sequence that is better for
> > regalloc/code
> > cache than fully blowin function call.
>
> I noticed btw that some of the generated string instructions are slower
> than just calling the C library.
>
> rep scasb etc. is rar
Jan Hubicka writes:
>
> libcall is not faster up to 8KB to rep sequence that is better for
> regalloc/code
> cache than fully blowin function call.
I noticed btw that some of the generated string instructions are slower
than just calling the C library.
rep scasb etc. is rarely a win over an op
Concerning 1push per cycle, I think it is same as K7 hardware did, so move
prologue should be a win.
> > Index: config/i386/i386.c
> > ===
> > --- config/i386/i386.c (revision 194452)
> > +++ config/i386/i386.c (working copy)
> > @@
Honza, can you explain each change and point to the reference?
thanks,
David
On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good
On Wed, Dec 12, 2012 at 8:37 AM, Jan Hubicka wrote:
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because it breaks t
> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency. In modern
> micro-architecture, push
On Tue, Dec 11, 2012 at 11:53 PM, Xinliang David Li wrote:
> The following the O2 size data from SPEC2k. Note that with push/pop,
> it is a always a net win (negative delta) in terms of total binary or
> total loadable section size.
Thanks for the data!
Richard.
> thanks,
>
> David
>
>
Some SPEC2k performance number (with 3 runs on core2):
Push wins over move on 3 benchmarks. Others are noises.
perlbmk : ~+1.9%
gap: ~+1.4%
vortex:~ +0.7%
David
On Tue, Dec 11, 2012 at 2:53 PM, Xinliang David Li wrote:
> The following the O2 size data from SPEC2k. Note that with pus
The following the O2 size data from SPEC2k. Note that with push/pop,
it is a always a net win (negative delta) in terms of total binary or
total loadable section size.
thanks,
David
.text.eh_frame Total_binary
vortex-move 440252 40796 584066
vortex-push 415436 57452 5759
On Tue, Dec 11, 2012 at 1:49 AM, Richard Biener
wrote:
> On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump wrote:
>> On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote:
>>> I have not measured the CFI size impact -- but conceivably it should
>>> be larger -- which is unfortunate.
>>
>> Code speed
On Mon, Dec 10, 2012 at 10:07 PM, Mike Stump wrote:
> On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote:
>> I have not measured the CFI size impact -- but conceivably it should
>> be larger -- which is unfortunate.
>
> Code speed and size are preferable to optimizing dwarf size… :-) I'd let
On Dec 10, 2012, at 12:42 PM, Xinliang David Li wrote:
> I have not measured the CFI size impact -- but conceivably it should
> be larger -- which is unfortunate.
Code speed and size are preferable to optimizing dwarf size… :-) I'd let
dwarf 5 fix it!
I have not measured the CFI size impact -- but conceivably it should
be larger -- which is unfortunate.
David
On Mon, Dec 10, 2012 at 1:23 AM, Richard Biener
wrote:
> On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak wrote:
>> Hello!
>>
>>> I noticed in prologue/epilogue, GCC prefers to use MOVs foll
On Sun, Dec 9, 2012 at 2:50 PM, Uros Bizjak wrote:
> Hello!
>
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), because i
s/Eanble/Enable/
Thanks,
Dmitry
2012/12/9 Uros Bizjak :
> Hello!
>
>> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
>> SP adjustment instead of a sequence of pushes/pops. The preference to
>> the MOVs are good for old CPU micro-architectures (before pentium-4,
>> K10), be
Hello!
> I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
> SP adjustment instead of a sequence of pushes/pops. The preference to
> the MOVs are good for old CPU micro-architectures (before pentium-4,
> K10), because it breaks the data dependency. In modern
> micro-architectu
I noticed in prologue/epilogue, GCC prefers to use MOVs followed by a
SP adjustment instead of a sequence of pushes/pops. The preference to
the MOVs are good for old CPU micro-architectures (before pentium-4,
K10), because it breaks the data dependency. In modern
micro-architecture, push/pop is im
40 matches
Mail list logo