Hi Anatol,

Thanks for update, I'll take a look a bit later, but the performance
difference looks quite good now.

Dmitry.

On Wed, Oct 8, 2014 at 11:26 AM, Anatol Belski <anatol....@belski.net>
wrote:

> Moin Dmitry,
>
> On Mon, October 6, 2014 09:01, Anatol Belski wrote:
> > On Sun, October 5, 2014 21:32, Anatol Belski wrote:
> >
> >> Hi Dmitry,
> >>
> >>
> >>
> >> On Wed, October 1, 2014 08:01, Dmitry Stogov wrote:
> >>
> >>
> >>> Hi Anatol,
> >>>
> >>>
> >>>
> >>>
> >>> I know, TSRM uses TLS APIs internally.
> >>>
> >>>
> >>>
> >>>
> >>> In my opinion, the simplest (and probably efficient) way to get rid
> >>> of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a
> >>> global thread specific variable.
> >>>
> >>> __thread void ***tsrm_ls;
> >>>
> >>>
> >>>
> >>>
> >>> As I understood it won't work on Windows anyway, because windows
> >>> linker is not smart enough to use TLS variables across different DLLs.
> >>> May be
> >>> it's possible to have a local thread specific copy of tsrm_ls for each
> >>>  DLL, but
> >>> then we should make them to be consistent...
> >>>
> >>> Sorry, I can't give you any advice, and can't spend a lot of time on
> >>> this topic.
> >>>
> >>> May be description of TLS internals on ELF systems would give you
> >>> some ideas.
> >>>
> >>> http://www.akkadia.org/drepper/tls.pdf
> >>>
> >>>
> >>>
> >>>
> >>> Thanks. Dmitry.
> >>>
> >>>
> >>>
> >>>
> >> I've reworked this patch to take a pointer per one shared unit. Please
> >> see here
> >> http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f
> >> ac 47c217d4927ec6f6
> >> (though this was just the first in the series). Afterwards I've adapted
> >> ext/standard and also converted ext/sockets as an exemplary item because
> >>  it's usually compiled shared.
> >>
> >> With this change I experience much better performance - a diff is in
> >> 100-50ms range compared to the master TS build. Particular positions in
> >> bench.php show even some better result.
> >>
> >> However this is not a global __thread variable, but a local one to
> >> every shared unit. Say tsrm_ls will have to be declared in every so, dll
> >> or exe and updated on request. For now I've put the update code in MINIT
> >> and into the first ctor (zmm is the one in the php7ts.dll) called. The
> >> ctor seems to be the only reliable place (but maybe I'm wrong), despite
> >> it'll be called for every request instead of per thread, that won't be
> >> very bad.
> >>
> >>
> >> I'd suggest to go this way so we have the same flow everywhere.
> >>
> >>
> >>
> the perf issue is fixed now, still yet core only converted, but here are
> Zend/bench.php results on 64 bit
>
> master ts linux
>
> simple             0.158
> simplecall         0.050
> simpleucall        0.148
> simpleudcall       0.151
> mandel             0.310
> mandel2            0.337
> ackermann(7)       0.088
> ary(50000)         0.010
> ary2(50000)        0.009
> ary3(2000)         0.154
> fibo(30)           0.285
> hash1(50000)       0.029
> hash2(500)         0.023
> heapsort(20000)    0.072
> matrix(20)         0.082
> nestedloop(12)     0.204
> sieve(30)          0.062
> strcat(200000)     0.014
> ------------------------
> Total              2.185
>
>
> native-tls linux
>
> simple             0.072
> simplecall         0.036
> simpleucall        0.163
> simpleudcall       0.169
> mandel             0.297
> mandel2            0.354
> ackermann(7)       0.123
> ary(50000)         0.010
> ary2(50000)        0.009
> ary3(2000)         0.158
> fibo(30)           0.396
> hash1(50000)       0.030
> hash2(500)         0.024
> heapsort(20000)    0.072
> matrix(20)         0.069
> nestedloop(12)     0.130
> sieve(30)          0.054
> strcat(200000)     0.011
> ------------------------
> Total              2.178
>
>
> master ts windows
>
> simple             0.100
> simplecall         0.048
> simpleucall        0.146
> simpleudcall       0.120
> mandel             0.292
> mandel2            0.364
> ackermann(7)       0.091
> ary(50000)         0.009
> ary2(50000)        0.008
> ary3(2000)         0.133
> fibo(30)           0.238
> hash1(50000)       0.025
> hash2(500)         0.020
> heapsort(20000)    0.076
> matrix(20)         0.069
> nestedloop(12)     0.168
> sieve(30)          0.048
> strcat(200000)     0.011
> ------------------------
> Total              1.965
>
>
> native-tls windows
>
> simple             0.100
> simplecall         0.050
> simpleucall        0.108
> simpleudcall       0.110
> mandel             0.292
> mandel2            0.347
> ackermann(7)       0.097
> ary(50000)         0.009
> ary2(50000)        0.008
> ary3(2000)         0.140
> fibo(30)           0.280
> hash1(50000)       0.025
> hash2(500)         0.021
> heapsort(20000)    0.075
> matrix(20)         0.072
> nestedloop(12)     0.176
> sieve(30)          0.048
> strcat(200000)     0.010
> ------------------------
> Total              1.969
>
>
> Still there is some room for improvement (for instance the fibo results),
> but the overall result shows at least same perf now. What do you think
> guys?
>
> Regards
>
> Anatol
>
>

Reply via email to