I can't remember if I did any special benchmarks except for bench.php when
I introduced fast math functions.
That time, I rearranged the code to allow inlining of the most probable
paths and added assembler code to catch overflow (C can't do it in optimal
way). As I remember the bench.php showed some visible improvement.

even increment/decrement save 1 CPU instruction on fast path.

    inc (%ecx)
    jno FAST_PATH
    ...
FASR_PATH:

instead of

    cmp (%ecx), $0x7fffffff
    je SLOW_PATH
    inc (%ecx)
FAST_PATH:

However, I'm not sure if this saved instruction makes any visible speed
difference by itself.

Thanks. Dmitry.


On Mon, Feb 4, 2013 at 2:38 PM, Ard Biesheuvel <ard.biesheu...@linaro.org>wrote:

> Hi Dimitry,
>
> The main problem I have with this code is that most of it (the double
> handling) is outside the hot path, and that it is riddled with
> hardcoded constants, struct offsets etc. However, if it works than I
> am not necessarily in favour of making changes to it.
>
> So can you explain a little bit which benchmarks you used to prove
> that the inline assembly is faster than C? Especially in the
> increment/decrement cases, there is no real overflow detection
> necessary other than comparing with LONG_MIN/LONG_MAX, so I would
> expect the compiler to generate fairly optimal code in these cases.
>
> I am not trying to challenge these decisions, mind you. I am trying to
> decide whether ARM will require similar handling as x86 to obtain
> optimal performance.
>
> Thanks,
> Ard.
>
>
>
> On 4 February 2013 11:32, Dmitry Stogov <dmi...@zend.com> wrote:
> > Hi Ard,
> >
> > Actually with your patch the fast_increment_function() is going to be
> > compile into something like this
> >
> >     incl (%ecx)
> >     seto %al
> >     test %al,%al
> >     jz .FLOAT
> > .END:
> > ...
> > .FLOAT:
> >     movl $0x0, (%ecx)
> >     movl $0x41e00000, 0x4(%ecx)
> >     movb $0x2,0xc(%ecx)
> >     jmp . END
> >
> > while before the patch it would
> >
> >     incl (%ecx)
> >     jno .END
> > .FLOATL
> >     movl $0x0, (%ecx)
> >     movl $0x41e00000, 0x4(%ecx)
> >     movb $0x2,0xc(%ecx)
> > .END:
> > ...
> >
> > So the only advantage of your code is eliminated static branch
> misprediction
> > in cost of two additional CPU instructions.
> > However CPU branch predictor should make this advantage unimportant.
> >
> > Thanks. Dmitry.
> >
> >
> > On Fri, Jan 18, 2013 at 10:08 PM, Ard Biesheuvel <
> ard.biesheu...@linaro.org>
> > wrote:
> >>
> >> Hello,
> >>
> >> Again, apologies for prematurely declaring someone else's code 'crap'.
> >> There are no bugs in the inline x86 assembler in Zend/zend_operators.h,
> as
> >> far as I can tell, only two kinds of issues that I still think should be
> >> addressed.
> >>
> >> First of all, from a maintenance pov, having field offsets (like the
> >> offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded
> >> inside the instructions is a bad idea.
> >>
> >> The other issue is the branching and the floating point instructions.
> The
> >> inline assembler addresses the common case, but also adds a bunch of
> >> instructions that address the corner case, and some branches to jump
> over
> >> them. As I indicated in my previous email, branching is relatively
> costly on
> >> a modern CPU with deep pipelines and having a bunch of FPU instructions
> in
> >> there that hardly ever get executed doesn't help either.
> >>
> >> The primary reason for having inline assembler at all is the ability to
> >> detect overflow. This mainly applies to multiplication, as in that case,
> >> detecting overflow in C code is much harder compared to reading a
> condition
> >> flag in the CPU (hence the various accelerated implementations in
> >> zend_signed_multiply.h). However, detecting overflow in
> addition/subtraction
> >> implemented in C is much easier, as the code in zend_operators.h proves:
> >> just a matter of checking the sign bits, or doing a simple compare with
> >> LONG_MIN/LONG_MAX.
> >>
> >> Therefore, I would be interested in finding out which benchmark was used
> >> to make the case for having these accelerated implementations in the
> first
> >> place. The differences in performance between various implementations
> are
> >> very small in the tests I have done.
> >>
> >> As for the code style/maintainability, I propose to apply the attached
> >> patch. The performance is on par, as far as I can tell, but it is
> arguably
> >> better code. I will also hook in the ARM versions once I manage to prove
> >> that the performance is affected favourably by them.
> >>
> >> Regards,
> >> Ard.
> >>
> >>
> >>
> >> Before
> >> -------
> >>
> >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'
> >>
> >> real    0m56.910s
> >> user    0m56.876s
> >> sys     0m0.008s
> >>
> >>
> >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'
> >>
> >> real    1m34.576s
> >> user    1m34.518s
> >> sys     0m0.020s
> >>
> >>
> >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'
> >>
> >> real    0m21.494s
> >> user    0m21.473s
> >> sys     0m0.008s
> >>
> >>
> >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'
> >>
> >> real    0m19.879s
> >> user    0m19.865s
> >> sys     0m0.004s
> >>
> >>
> >> After
> >> -----
> >>
> >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'
> >>
> >> real    0m56.687s
> >> user    0m56.656s
> >> sys     0m0.004s
> >>
> >>
> >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'
> >>
> >> real    1m28.124s
> >> user    1m28.082s
> >> sys     0m0.004s
> >>
> >>
> >> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'
> >>
> >> real    0m20.561s
> >> user    0m20.545s
> >> sys     0m0.004s
> >>
> >>
> >> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'
> >>
> >> real    0m20.524s
> >> user    0m20.509s
> >> sys     0m0.004s
> >>
> >>
> >> --
> >> PHP Internals - PHP Runtime Development Mailing List
> >> To unsubscribe, visit: http://www.php.net/unsub.php
> >
> >
>

Reply via email to