Hi Anatol,

At first, I still saw the same big difference on Linux.
bench.php ZTS - 1.340 sec, native TLS - 1.785 sec.
As I understood, it must be related to incomplete changes in build scripts,
related to ZEND_ENABLE_STATIC_TSRMLS_CACHE. Right?

If I get it properly, main PHP binary should be compiled with
-DZEND_ENABLE_STATIC_TSRMLS_CACHE=1 and shared extensions without it. It
should lead to quite fast code in main PHP binary and statically linked
extensions, but to slow code in shared extensions. Right?

I built PHP in this way with all extensions linked statically. Now, I see
small slowdown on bench.php (however according to callgrind it executes
less instructions and should be faster). Wordpress became 2% faster.

So the patch becomes interesting. :)
However, many distributions prefer shard extensions, and it would be great
to invent some trick to make them fast too.

I would also prefer to keep the semantic patch small and don't delete all
FETCH_TSRM() in thousand places (at this point).
Replacing macro in one place must be easier.
It's not a problem to remove them on second step if the PoC would really
work.

Thanks. Dmitry.

On Wed, Oct 8, 2014 at 12:18 PM, Dmitry Stogov <dmi...@zend.com> wrote:

> Hi Anatol,
>
> Thanks for update, I'll take a look a bit later, but the performance
> difference looks quite good now.
>
> Dmitry.
>
> On Wed, Oct 8, 2014 at 11:26 AM, Anatol Belski <anatol....@belski.net>
> wrote:
>
>> Moin Dmitry,
>>
>> On Mon, October 6, 2014 09:01, Anatol Belski wrote:
>> > On Sun, October 5, 2014 21:32, Anatol Belski wrote:
>> >
>> >> Hi Dmitry,
>> >>
>> >>
>> >>
>> >> On Wed, October 1, 2014 08:01, Dmitry Stogov wrote:
>> >>
>> >>
>> >>> Hi Anatol,
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> I know, TSRM uses TLS APIs internally.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> In my opinion, the simplest (and probably efficient) way to get rid
>> >>> of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a
>> >>> global thread specific variable.
>> >>>
>> >>> __thread void ***tsrm_ls;
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> As I understood it won't work on Windows anyway, because windows
>> >>> linker is not smart enough to use TLS variables across different DLLs.
>> >>> May be
>> >>> it's possible to have a local thread specific copy of tsrm_ls for each
>> >>>  DLL, but
>> >>> then we should make them to be consistent...
>> >>>
>> >>> Sorry, I can't give you any advice, and can't spend a lot of time on
>> >>> this topic.
>> >>>
>> >>> May be description of TLS internals on ELF systems would give you
>> >>> some ideas.
>> >>>
>> >>> http://www.akkadia.org/drepper/tls.pdf
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Thanks. Dmitry.
>> >>>
>> >>>
>> >>>
>> >>>
>> >> I've reworked this patch to take a pointer per one shared unit. Please
>> >> see here
>> >>
>> http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f
>> >> ac 47c217d4927ec6f6
>> >> (though this was just the first in the series). Afterwards I've adapted
>> >> ext/standard and also converted ext/sockets as an exemplary item
>> because
>> >>  it's usually compiled shared.
>> >>
>> >> With this change I experience much better performance - a diff is in
>> >> 100-50ms range compared to the master TS build. Particular positions in
>> >> bench.php show even some better result.
>> >>
>> >> However this is not a global __thread variable, but a local one to
>> >> every shared unit. Say tsrm_ls will have to be declared in every so,
>> dll
>> >> or exe and updated on request. For now I've put the update code in
>> MINIT
>> >> and into the first ctor (zmm is the one in the php7ts.dll) called. The
>> >> ctor seems to be the only reliable place (but maybe I'm wrong), despite
>> >> it'll be called for every request instead of per thread, that won't be
>> >> very bad.
>> >>
>> >>
>> >> I'd suggest to go this way so we have the same flow everywhere.
>> >>
>> >>
>> >>
>> the perf issue is fixed now, still yet core only converted, but here are
>> Zend/bench.php results on 64 bit
>>
>> master ts linux
>>
>> simple             0.158
>> simplecall         0.050
>> simpleucall        0.148
>> simpleudcall       0.151
>> mandel             0.310
>> mandel2            0.337
>> ackermann(7)       0.088
>> ary(50000)         0.010
>> ary2(50000)        0.009
>> ary3(2000)         0.154
>> fibo(30)           0.285
>> hash1(50000)       0.029
>> hash2(500)         0.023
>> heapsort(20000)    0.072
>> matrix(20)         0.082
>> nestedloop(12)     0.204
>> sieve(30)          0.062
>> strcat(200000)     0.014
>> ------------------------
>> Total              2.185
>>
>>
>> native-tls linux
>>
>> simple             0.072
>> simplecall         0.036
>> simpleucall        0.163
>> simpleudcall       0.169
>> mandel             0.297
>> mandel2            0.354
>> ackermann(7)       0.123
>> ary(50000)         0.010
>> ary2(50000)        0.009
>> ary3(2000)         0.158
>> fibo(30)           0.396
>> hash1(50000)       0.030
>> hash2(500)         0.024
>> heapsort(20000)    0.072
>> matrix(20)         0.069
>> nestedloop(12)     0.130
>> sieve(30)          0.054
>> strcat(200000)     0.011
>> ------------------------
>> Total              2.178
>>
>>
>> master ts windows
>>
>> simple             0.100
>> simplecall         0.048
>> simpleucall        0.146
>> simpleudcall       0.120
>> mandel             0.292
>> mandel2            0.364
>> ackermann(7)       0.091
>> ary(50000)         0.009
>> ary2(50000)        0.008
>> ary3(2000)         0.133
>> fibo(30)           0.238
>> hash1(50000)       0.025
>> hash2(500)         0.020
>> heapsort(20000)    0.076
>> matrix(20)         0.069
>> nestedloop(12)     0.168
>> sieve(30)          0.048
>> strcat(200000)     0.011
>> ------------------------
>> Total              1.965
>>
>>
>> native-tls windows
>>
>> simple             0.100
>> simplecall         0.050
>> simpleucall        0.108
>> simpleudcall       0.110
>> mandel             0.292
>> mandel2            0.347
>> ackermann(7)       0.097
>> ary(50000)         0.009
>> ary2(50000)        0.008
>> ary3(2000)         0.140
>> fibo(30)           0.280
>> hash1(50000)       0.025
>> hash2(500)         0.021
>> heapsort(20000)    0.075
>> matrix(20)         0.072
>> nestedloop(12)     0.176
>> sieve(30)          0.048
>> strcat(200000)     0.010
>> ------------------------
>> Total              1.969
>>
>>
>> Still there is some room for improvement (for instance the fibo results),
>> but the overall result shows at least same perf now. What do you think
>> guys?
>>
>> Regards
>>
>> Anatol
>>
>>
>

Reply via email to