Hi Anatol, Thanks for update, I'll take a look a bit later, but the performance difference looks quite good now.
Dmitry. On Wed, Oct 8, 2014 at 11:26 AM, Anatol Belski <anatol....@belski.net> wrote: > Moin Dmitry, > > On Mon, October 6, 2014 09:01, Anatol Belski wrote: > > On Sun, October 5, 2014 21:32, Anatol Belski wrote: > > > >> Hi Dmitry, > >> > >> > >> > >> On Wed, October 1, 2014 08:01, Dmitry Stogov wrote: > >> > >> > >>> Hi Anatol, > >>> > >>> > >>> > >>> > >>> I know, TSRM uses TLS APIs internally. > >>> > >>> > >>> > >>> > >>> In my opinion, the simplest (and probably efficient) way to get rid > >>> of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a > >>> global thread specific variable. > >>> > >>> __thread void ***tsrm_ls; > >>> > >>> > >>> > >>> > >>> As I understood it won't work on Windows anyway, because windows > >>> linker is not smart enough to use TLS variables across different DLLs. > >>> May be > >>> it's possible to have a local thread specific copy of tsrm_ls for each > >>> DLL, but > >>> then we should make them to be consistent... > >>> > >>> Sorry, I can't give you any advice, and can't spend a lot of time on > >>> this topic. > >>> > >>> May be description of TLS internals on ELF systems would give you > >>> some ideas. > >>> > >>> http://www.akkadia.org/drepper/tls.pdf > >>> > >>> > >>> > >>> > >>> Thanks. Dmitry. > >>> > >>> > >>> > >>> > >> I've reworked this patch to take a pointer per one shared unit. Please > >> see here > >> http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f > >> ac 47c217d4927ec6f6 > >> (though this was just the first in the series). Afterwards I've adapted > >> ext/standard and also converted ext/sockets as an exemplary item because > >> it's usually compiled shared. > >> > >> With this change I experience much better performance - a diff is in > >> 100-50ms range compared to the master TS build. Particular positions in > >> bench.php show even some better result. > >> > >> However this is not a global __thread variable, but a local one to > >> every shared unit. Say tsrm_ls will have to be declared in every so, dll > >> or exe and updated on request. For now I've put the update code in MINIT > >> and into the first ctor (zmm is the one in the php7ts.dll) called. The > >> ctor seems to be the only reliable place (but maybe I'm wrong), despite > >> it'll be called for every request instead of per thread, that won't be > >> very bad. > >> > >> > >> I'd suggest to go this way so we have the same flow everywhere. > >> > >> > >> > the perf issue is fixed now, still yet core only converted, but here are > Zend/bench.php results on 64 bit > > master ts linux > > simple 0.158 > simplecall 0.050 > simpleucall 0.148 > simpleudcall 0.151 > mandel 0.310 > mandel2 0.337 > ackermann(7) 0.088 > ary(50000) 0.010 > ary2(50000) 0.009 > ary3(2000) 0.154 > fibo(30) 0.285 > hash1(50000) 0.029 > hash2(500) 0.023 > heapsort(20000) 0.072 > matrix(20) 0.082 > nestedloop(12) 0.204 > sieve(30) 0.062 > strcat(200000) 0.014 > ------------------------ > Total 2.185 > > > native-tls linux > > simple 0.072 > simplecall 0.036 > simpleucall 0.163 > simpleudcall 0.169 > mandel 0.297 > mandel2 0.354 > ackermann(7) 0.123 > ary(50000) 0.010 > ary2(50000) 0.009 > ary3(2000) 0.158 > fibo(30) 0.396 > hash1(50000) 0.030 > hash2(500) 0.024 > heapsort(20000) 0.072 > matrix(20) 0.069 > nestedloop(12) 0.130 > sieve(30) 0.054 > strcat(200000) 0.011 > ------------------------ > Total 2.178 > > > master ts windows > > simple 0.100 > simplecall 0.048 > simpleucall 0.146 > simpleudcall 0.120 > mandel 0.292 > mandel2 0.364 > ackermann(7) 0.091 > ary(50000) 0.009 > ary2(50000) 0.008 > ary3(2000) 0.133 > fibo(30) 0.238 > hash1(50000) 0.025 > hash2(500) 0.020 > heapsort(20000) 0.076 > matrix(20) 0.069 > nestedloop(12) 0.168 > sieve(30) 0.048 > strcat(200000) 0.011 > ------------------------ > Total 1.965 > > > native-tls windows > > simple 0.100 > simplecall 0.050 > simpleucall 0.108 > simpleudcall 0.110 > mandel 0.292 > mandel2 0.347 > ackermann(7) 0.097 > ary(50000) 0.009 > ary2(50000) 0.008 > ary3(2000) 0.140 > fibo(30) 0.280 > hash1(50000) 0.025 > hash2(500) 0.021 > heapsort(20000) 0.075 > matrix(20) 0.072 > nestedloop(12) 0.176 > sieve(30) 0.048 > strcat(200000) 0.010 > ------------------------ > Total 1.969 > > > Still there is some room for improvement (for instance the fibo results), > but the overall result shows at least same perf now. What do you think > guys? > > Regards > > Anatol > >