Hi Andone,

    I'm not sure why nobody has replied to you yet, we've all looked at the
PR and spent a lot of the day yesterday discussing it.

    I've CC'd Dmitry, he doesn't always read internals, so this should get
his attention.

    Lastly, very cool ... I look forward to some more cleverness ...

Cheers
Joe

On Wed, Jul 29, 2015 at 3:22 PM, Andone, Bogdan <bogdan.and...@intel.com>
wrote:

> Hi Guys,
>
> My name is Bogdan Andone and I work for Intel in the area of SW
> performance analysis and optimizations.
> We would like to actively contribute to Zend PHP project and to involve
> ourselves in finding new performance improvement opportunities based on
> available and/or new hardware features.
> I am still in the source code digesting phase but I had a look to the
> fast_memcpy() implementation in opcache extension which uses SSE intrinsics:
>
> If I am not wrong fast_memcpy() function is not currently used, as I
> didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably
> didn't see any performance benefit so you preserved generic memcpy() usage.
>
> I would like to propose a slightly different implementation which uses
> _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
> memory is preserved in data cache, which is not bad as the interpreter will
> start to use this data without the need to go back one more time to memory.
> _mm_stream_si128() in the current implementation is intended to be used for
> stores where we want to avoid reading data into the cache and the cache
> pollution; in opcache scenario it seems that preserving the data in cache
> has a positive impact.
>
> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
> increase for the new version of fast_memcpy() compared with the generic
> memcpy(). Same result using a full load test with http_load on a Haswell EP
> 18 cores.
>
> Here is the proposed pull request:
> https://github.com/php/php-src/pull/1446
>
> Related to the SW prefetching instructions in fast_memcpy()... they are
> not really useful in this place. There benefit is almost negligible as the
> address requested for prefetch will be needed at the next iteration (few
> cycles later), while the time needed to get data from RAM is >100 cycles
> usually.. Nevertheless... they don't heart and it seems they still have a
> very small benefit so I preserved the original instruction and I added a
> new prefetch request for the destination pointer.
>
> Hope it helps,
> Bogdan
>

Reply via email to