Hi Anatol, At first, I still saw the same big difference on Linux. bench.php ZTS - 1.340 sec, native TLS - 1.785 sec. As I understood, it must be related to incomplete changes in build scripts, related to ZEND_ENABLE_STATIC_TSRMLS_CACHE. Right?
If I get it properly, main PHP binary should be compiled with -DZEND_ENABLE_STATIC_TSRMLS_CACHE=1 and shared extensions without it. It should lead to quite fast code in main PHP binary and statically linked extensions, but to slow code in shared extensions. Right? I built PHP in this way with all extensions linked statically. Now, I see small slowdown on bench.php (however according to callgrind it executes less instructions and should be faster). Wordpress became 2% faster. So the patch becomes interesting. :) However, many distributions prefer shard extensions, and it would be great to invent some trick to make them fast too. I would also prefer to keep the semantic patch small and don't delete all FETCH_TSRM() in thousand places (at this point). Replacing macro in one place must be easier. It's not a problem to remove them on second step if the PoC would really work. Thanks. Dmitry. On Wed, Oct 8, 2014 at 12:18 PM, Dmitry Stogov <dmi...@zend.com> wrote: > Hi Anatol, > > Thanks for update, I'll take a look a bit later, but the performance > difference looks quite good now. > > Dmitry. > > On Wed, Oct 8, 2014 at 11:26 AM, Anatol Belski <anatol....@belski.net> > wrote: > >> Moin Dmitry, >> >> On Mon, October 6, 2014 09:01, Anatol Belski wrote: >> > On Sun, October 5, 2014 21:32, Anatol Belski wrote: >> > >> >> Hi Dmitry, >> >> >> >> >> >> >> >> On Wed, October 1, 2014 08:01, Dmitry Stogov wrote: >> >> >> >> >> >>> Hi Anatol, >> >>> >> >>> >> >>> >> >>> >> >>> I know, TSRM uses TLS APIs internally. >> >>> >> >>> >> >>> >> >>> >> >>> In my opinion, the simplest (and probably efficient) way to get rid >> >>> of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a >> >>> global thread specific variable. >> >>> >> >>> __thread void ***tsrm_ls; >> >>> >> >>> >> >>> >> >>> >> >>> As I understood it won't work on Windows anyway, because windows >> >>> linker is not smart enough to use TLS variables across different DLLs. >> >>> May be >> >>> it's possible to have a local thread specific copy of tsrm_ls for each >> >>> DLL, but >> >>> then we should make them to be consistent... >> >>> >> >>> Sorry, I can't give you any advice, and can't spend a lot of time on >> >>> this topic. >> >>> >> >>> May be description of TLS internals on ELF systems would give you >> >>> some ideas. >> >>> >> >>> http://www.akkadia.org/drepper/tls.pdf >> >>> >> >>> >> >>> >> >>> >> >>> Thanks. Dmitry. >> >>> >> >>> >> >>> >> >>> >> >> I've reworked this patch to take a pointer per one shared unit. Please >> >> see here >> >> >> http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f >> >> ac 47c217d4927ec6f6 >> >> (though this was just the first in the series). Afterwards I've adapted >> >> ext/standard and also converted ext/sockets as an exemplary item >> because >> >> it's usually compiled shared. >> >> >> >> With this change I experience much better performance - a diff is in >> >> 100-50ms range compared to the master TS build. Particular positions in >> >> bench.php show even some better result. >> >> >> >> However this is not a global __thread variable, but a local one to >> >> every shared unit. Say tsrm_ls will have to be declared in every so, >> dll >> >> or exe and updated on request. For now I've put the update code in >> MINIT >> >> and into the first ctor (zmm is the one in the php7ts.dll) called. The >> >> ctor seems to be the only reliable place (but maybe I'm wrong), despite >> >> it'll be called for every request instead of per thread, that won't be >> >> very bad. >> >> >> >> >> >> I'd suggest to go this way so we have the same flow everywhere. >> >> >> >> >> >> >> the perf issue is fixed now, still yet core only converted, but here are >> Zend/bench.php results on 64 bit >> >> master ts linux >> >> simple 0.158 >> simplecall 0.050 >> simpleucall 0.148 >> simpleudcall 0.151 >> mandel 0.310 >> mandel2 0.337 >> ackermann(7) 0.088 >> ary(50000) 0.010 >> ary2(50000) 0.009 >> ary3(2000) 0.154 >> fibo(30) 0.285 >> hash1(50000) 0.029 >> hash2(500) 0.023 >> heapsort(20000) 0.072 >> matrix(20) 0.082 >> nestedloop(12) 0.204 >> sieve(30) 0.062 >> strcat(200000) 0.014 >> ------------------------ >> Total 2.185 >> >> >> native-tls linux >> >> simple 0.072 >> simplecall 0.036 >> simpleucall 0.163 >> simpleudcall 0.169 >> mandel 0.297 >> mandel2 0.354 >> ackermann(7) 0.123 >> ary(50000) 0.010 >> ary2(50000) 0.009 >> ary3(2000) 0.158 >> fibo(30) 0.396 >> hash1(50000) 0.030 >> hash2(500) 0.024 >> heapsort(20000) 0.072 >> matrix(20) 0.069 >> nestedloop(12) 0.130 >> sieve(30) 0.054 >> strcat(200000) 0.011 >> ------------------------ >> Total 2.178 >> >> >> master ts windows >> >> simple 0.100 >> simplecall 0.048 >> simpleucall 0.146 >> simpleudcall 0.120 >> mandel 0.292 >> mandel2 0.364 >> ackermann(7) 0.091 >> ary(50000) 0.009 >> ary2(50000) 0.008 >> ary3(2000) 0.133 >> fibo(30) 0.238 >> hash1(50000) 0.025 >> hash2(500) 0.020 >> heapsort(20000) 0.076 >> matrix(20) 0.069 >> nestedloop(12) 0.168 >> sieve(30) 0.048 >> strcat(200000) 0.011 >> ------------------------ >> Total 1.965 >> >> >> native-tls windows >> >> simple 0.100 >> simplecall 0.050 >> simpleucall 0.108 >> simpleudcall 0.110 >> mandel 0.292 >> mandel2 0.347 >> ackermann(7) 0.097 >> ary(50000) 0.009 >> ary2(50000) 0.008 >> ary3(2000) 0.140 >> fibo(30) 0.280 >> hash1(50000) 0.025 >> hash2(500) 0.021 >> heapsort(20000) 0.075 >> matrix(20) 0.072 >> nestedloop(12) 0.176 >> sieve(30) 0.048 >> strcat(200000) 0.010 >> ------------------------ >> Total 1.969 >> >> >> Still there is some room for improvement (for instance the fibo results), >> but the overall result shows at least same perf now. What do you think >> guys? >> >> Regards >> >> Anatol >> >> >