>>>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance >>>> increase for the new version of fast_memcpy() compared with the generic >>>> memcpy(). Same result using a full load test with http_load on a Haswell EP >>>> 18 cores. >>>> >>> >>> 1% is really big improvement. >>> I'll able to check this only on next week (when back from vacation). >> >> >> Well, he talks like he was comparing to *generic* memcpy(), so...? But not >> sure how that would have been accomplished. >> >> BTW guys, I was wondering before why fast_memcpy() only in this opcache >> area? For the prefetch and/or cache pollution reasons? > Just because, in this place we may copy big blocks, and we also may align > them properly, to use compact and fast Inlined code.
Yeah... in fact all my numbers are against the current fast_memcpy() implementation, not against generic memcpy(). Sorry for the misleading information... :-/. I was playing in my corner with some SSE4.2 experiments and I wasn’t aware that SSE2 is enabled by default without any need of compiler switch. Coming back to the issue and trying to answer also to laruence’s request for more numbers: I am running php-cgi -T10000 on a Haswell having 45MB L3 cache: The improvement is visible for scenarios where the amount of data loaded via opcache is significant while the real execution time is not so big; this is the case of real life scenarios: - WordPress 4.1 & MediaWiki 1.24: ~1% performance increase - Drupal 7.36: ~0.6% performance increase - The improvement is not visible on synthetic benchmarks (mandelbrot, micro_bench, …) which load a small amount of bytecode and are computing intensive. The explanation stays in data cache misses. I did a deeper analysis on Wordpress 4.1 using perf tool: - _mm_stream based implementation: ~3x10^-4 misses/instruction => 1.023 instructions/cycle - _mm_store based implementation: ~9x10^-6 misses/instruction (33x less) => 1.035 instructions/cycle So the overall performance gain is fully explained by the increase of instructions/cycle due to lower cache misses; copying the opcache data is a kind of "software prefetcher" for further execution. This phenomenon is most visible on processors with big caches. If I go to a lower L3 cache size (45MB -> 6.75MB) 1% WP gain became 0.6% gain (as the cache capability to keep "prefetched" opcahe data without polluting the execution path become smaller). Coming back to generic memcpy(), the fast_memcpy() implementation seems to be a very little bit smaller in terms of executed instructions (hard to measure the real IR data due to run to run variations). Doing a couple of measurements for absorbing run to run effect I see ~0.2% perfo increase in favor of fast_memcpy w/ mm_store; it is the same increase I see in the implementation w/ SW prefetchers compared with the case of no SW prefetch in place. So the gain we see might be explained by the fact that memcpy() do not use SW prefetching - just a guess... Kind Regards, Bogdan