Hi Dimitry,

The main problem I have with this code is that most of it (the double
handling) is outside the hot path, and that it is riddled with
hardcoded constants, struct offsets etc. However, if it works than I
am not necessarily in favour of making changes to it.

So can you explain a little bit which benchmarks you used to prove
that the inline assembly is faster than C? Especially in the
increment/decrement cases, there is no real overflow detection
necessary other than comparing with LONG_MIN/LONG_MAX, so I would
expect the compiler to generate fairly optimal code in these cases.

I am not trying to challenge these decisions, mind you. I am trying to
decide whether ARM will require similar handling as x86 to obtain
optimal performance.

Thanks,
Ard.



On 4 February 2013 11:32, Dmitry Stogov <dmi...@zend.com> wrote:
> Hi Ard,
>
> Actually with your patch the fast_increment_function() is going to be
> compile into something like this
>
>     incl (%ecx)
>     seto %al
>     test %al,%al
>     jz .FLOAT
> .END:
> ...
> .FLOAT:
>     movl $0x0, (%ecx)
>     movl $0x41e00000, 0x4(%ecx)
>     movb $0x2,0xc(%ecx)
>     jmp . END
>
> while before the patch it would
>
>     incl (%ecx)
>     jno .END
> .FLOATL
>     movl $0x0, (%ecx)
>     movl $0x41e00000, 0x4(%ecx)
>     movb $0x2,0xc(%ecx)
> .END:
> ...
>
> So the only advantage of your code is eliminated static branch misprediction
> in cost of two additional CPU instructions.
> However CPU branch predictor should make this advantage unimportant.
>
> Thanks. Dmitry.
>
>
> On Fri, Jan 18, 2013 at 10:08 PM, Ard Biesheuvel <ard.biesheu...@linaro.org>
> wrote:
>>
>> Hello,
>>
>> Again, apologies for prematurely declaring someone else's code 'crap'.
>> There are no bugs in the inline x86 assembler in Zend/zend_operators.h, as
>> far as I can tell, only two kinds of issues that I still think should be
>> addressed.
>>
>> First of all, from a maintenance pov, having field offsets (like the
>> offset of zval.type) and constants (like $0x2 for IS_DOUBLE) hard coded
>> inside the instructions is a bad idea.
>>
>> The other issue is the branching and the floating point instructions. The
>> inline assembler addresses the common case, but also adds a bunch of
>> instructions that address the corner case, and some branches to jump over
>> them. As I indicated in my previous email, branching is relatively costly on
>> a modern CPU with deep pipelines and having a bunch of FPU instructions in
>> there that hardly ever get executed doesn't help either.
>>
>> The primary reason for having inline assembler at all is the ability to
>> detect overflow. This mainly applies to multiplication, as in that case,
>> detecting overflow in C code is much harder compared to reading a condition
>> flag in the CPU (hence the various accelerated implementations in
>> zend_signed_multiply.h). However, detecting overflow in addition/subtraction
>> implemented in C is much easier, as the code in zend_operators.h proves:
>> just a matter of checking the sign bits, or doing a simple compare with
>> LONG_MIN/LONG_MAX.
>>
>> Therefore, I would be interested in finding out which benchmark was used
>> to make the case for having these accelerated implementations in the first
>> place. The differences in performance between various implementations are
>> very small in the tests I have done.
>>
>> As for the code style/maintainability, I propose to apply the attached
>> patch. The performance is on par, as far as I can tell, but it is arguably
>> better code. I will also hook in the ARM versions once I manage to prove
>> that the performance is affected favourably by them.
>>
>> Regards,
>> Ard.
>>
>>
>>
>> Before
>> -------
>>
>> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'
>>
>> real    0m56.910s
>> user    0m56.876s
>> sys     0m0.008s
>>
>>
>> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'
>>
>> real    1m34.576s
>> user    1m34.518s
>> sys     0m0.020s
>>
>>
>> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'
>>
>> real    0m21.494s
>> user    0m21.473s
>> sys     0m0.008s
>>
>>
>> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'
>>
>> real    0m19.879s
>> user    0m19.865s
>> sys     0m0.004s
>>
>>
>> After
>> -----
>>
>> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i++);'
>>
>> real    0m56.687s
>> user    0m56.656s
>> sys     0m0.004s
>>
>>
>> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i--);'
>>
>> real    1m28.124s
>> user    1m28.082s
>> sys     0m0.004s
>>
>>
>> $ time php -r 'for ($i = 0; $i < 0x7fffffff; $i += 3);'
>>
>> real    0m20.561s
>> user    0m20.545s
>> sys     0m0.004s
>>
>>
>> $ time php -r 'for ($i = 0x7fffffff; $i >= 0; $i -= 3);'
>>
>> real    0m20.524s
>> user    0m20.509s
>> sys     0m0.004s
>>
>>
>> --
>> PHP Internals - PHP Runtime Development Mailing List
>> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to