Hi Andone, I'm not sure why nobody has replied to you yet, we've all looked at the PR and spent a lot of the day yesterday discussing it.
I've CC'd Dmitry, he doesn't always read internals, so this should get his attention. Lastly, very cool ... I look forward to some more cleverness ... Cheers Joe On Wed, Jul 29, 2015 at 3:22 PM, Andone, Bogdan <bogdan.and...@intel.com> wrote: > Hi Guys, > > My name is Bogdan Andone and I work for Intel in the area of SW > performance analysis and optimizations. > We would like to actively contribute to Zend PHP project and to involve > ourselves in finding new performance improvement opportunities based on > available and/or new hardware features. > I am still in the source code digesting phase but I had a look to the > fast_memcpy() implementation in opcache extension which uses SSE intrinsics: > > If I am not wrong fast_memcpy() function is not currently used, as I > didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably > didn't see any performance benefit so you preserved generic memcpy() usage. > > I would like to propose a slightly different implementation which uses > _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied > memory is preserved in data cache, which is not bad as the interpreter will > start to use this data without the need to go back one more time to memory. > _mm_stream_si128() in the current implementation is intended to be used for > stores where we want to avoid reading data into the cache and the cache > pollution; in opcache scenario it seems that preserving the data in cache > has a positive impact. > > Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance > increase for the new version of fast_memcpy() compared with the generic > memcpy(). Same result using a full load test with http_load on a Haswell EP > 18 cores. > > Here is the proposed pull request: > https://github.com/php/php-src/pull/1446 > > Related to the SW prefetching instructions in fast_memcpy()... they are > not really useful in this place. There benefit is almost negligible as the > address requested for prefetch will be needed at the next iteration (few > cycles later), while the time needed to get data from RAM is >100 cycles > usually.. Nevertheless... they don't heart and it seems they still have a > very small benefit so I preserved the original instruction and I added a > new prefetch request for the destination pointer. > > Hope it helps, > Bogdan >