I can't remember if I did any special benchmarks except for bench.php when I introduced fast math functions. That time, I rearranged the code to allow inlining of the most probable paths and added assembler code to catch overflow (C can't do it in optimal way). As I remember the bench.php showed some visible improvement.
even increment/decrement save 1 CPU instruction on fast path. inc (%ecx) jno FAST_PATH ... FASR_PATH: instead of cmp (%ecx), $0x7fffffff je SLOW_PATH inc (%ecx) FAST_PATH: However, I'm not sure if this saved instruction makes any visible speed difference by itself. Thanks. Dmitry. On Mon, Feb 4, 2013 at 2:38 PM, Ard Biesheuvel <ard.biesheu...@linaro.org>wrote: > Hi Dimitry, > > The main problem I have with this code is that most of it (the double > handling) is outside the hot path, and that it is riddled with > hardcoded constants, struct offsets etc. However, if it works than I > am not necessarily in favour of making changes to it. > > So can you explain a little bit which benchmarks you used to prove > that the inline assembly is faster than C? Especially in the > increment/decrement cases, there is no real overflow detection > necessary other than comparing with LONG_MIN/LONG_MAX, so I would > expect the compiler to generate fairly optimal code in these cases. > > I am not trying to challenge these decisions, mind you. I am trying to > decide whether ARM will require similar handling as x86 to obtain > optimal performance. > > Thanks, > Ard. > > > > On 4 February 2013 11:32, Dmitry Stogov <dmi...@zend.com> wrote: > > Hi Ard, > > > > Actually with your patch the fast_increment_function() is going to be > > compile into something like this > > > > incl (%ecx) > > seto %al > > test %al,%al > > jz .FLOAT > > .END: > > ... > > .FLOAT: > > movl $0x0, (%ecx) > > movl $0x41e00000, 0x4(%ecx) > > movb $0x2,0xc(%ecx) > > jmp . END > > > > while before the patch it would > > > > incl (%ecx) > > jno .END > > .FLOATL > > movl $0x0, (%ecx) > > movl $0x41e00000, 0x4(%ecx) > > movb $0x2,0xc(%ecx) > > .END: > > ... > > > > So the only advantage of your code is eliminated static branch > misprediction > > in cost of two additional CPU instructions. > > However CPU branch predictor should make this advantage unimportant. > > > > Thanks. Dmitry. > > > > > > On Fri, Jan 18, 2013 at 10:08 PM, Ard Biesheuvel < > ard.biesheu...@linaro.org> > > wrote: > >> > >> Hello, > >> > >> Again, apologies for prematurely declaring someone else's code 'crap'. > >> There are no bugs in the inline x86 assembler in Zend/zend_operators.h, > as > >> far as I can tell, only two kinds of issues that I still think should be > >> addressed. > >> > >> First of all, from a maintenance pov, having field offsets (like the > >> offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded > >> inside the instructions is a bad idea. > >> > >> The other issue is the branching and the floating point instructions. > The > >> inline assembler addresses the common case, but also adds a bunch of > >> instructions that address the corner case, and some branches to jump > over > >> them. As I indicated in my previous email, branching is relatively > costly on > >> a modern CPU with deep pipelines and having a bunch of FPU instructions > in > >> there that hardly ever get executed doesn't help either. > >> > >> The primary reason for having inline assembler at all is the ability to > >> detect overflow. This mainly applies to multiplication, as in that case, > >> detecting overflow in C code is much harder compared to reading a > condition > >> flag in the CPU (hence the various accelerated implementations in > >> zend_signed_multiply.h). However, detecting overflow in > addition/subtraction > >> implemented in C is much easier, as the code in zend_operators.h proves: > >> just a matter of checking the sign bits, or doing a simple compare with > >> LONG_MIN/LONG_MAX. > >> > >> Therefore, I would be interested in finding out which benchmark was used > >> to make the case for having these accelerated implementations in the > first > >> place. The differences in performance between various implementations > are > >> very small in the tests I have done. > >> > >> As for the code style/maintainability, I propose to apply the attached > >> patch. The performance is on par, as far as I can tell, but it is > arguably > >> better code. I will also hook in the ARM versions once I manage to prove > >> that the performance is affected favourably by them. > >> > >> Regards, > >> Ard. > >> > >> > >> > >> Before > >> ------- > >> > >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);' > >> > >> real 0m56.910s > >> user 0m56.876s > >> sys 0m0.008s > >> > >> > >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);' > >> > >> real 1m34.576s > >> user 1m34.518s > >> sys 0m0.020s > >> > >> > >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);' > >> > >> real 0m21.494s > >> user 0m21.473s > >> sys 0m0.008s > >> > >> > >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);' > >> > >> real 0m19.879s > >> user 0m19.865s > >> sys 0m0.004s > >> > >> > >> After > >> ----- > >> > >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);' > >> > >> real 0m56.687s > >> user 0m56.656s > >> sys 0m0.004s > >> > >> > >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);' > >> > >> real 1m28.124s > >> user 1m28.082s > >> sys 0m0.004s > >> > >> > >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);' > >> > >> real 0m20.561s > >> user 0m20.545s > >> sys 0m0.004s > >> > >> > >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);' > >> > >> real 0m20.524s > >> user 0m20.509s > >> sys 0m0.004s > >> > >> > >> -- > >> PHP Internals - PHP Runtime Development Mailing List > >> To unsubscribe, visit: http://www.php.net/unsub.php > > > > >