On Jul 31, 2015 2:12 AM, "Matt Wilmas" <php_li...@realplain.com> wrote:
>
> Hi Dmitry, Bogdan,
>
>
> ----- Original Message -----
> From: "Dmitry Stogov"
> Sent: Thursday, July 30, 2015
>
>> Hi Bogdan,
>>
>> On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan <bogdan.and...@intel.com>
>> wrote:
>>
>>> Hi Guys,
>>>
>>> My name is Bogdan Andone and I work for Intel in the area of SW
>>> performance analysis and optimizations.
>>> We would like to actively contribute to Zend PHP project and to involve
>>> ourselves in finding new performance improvement opportunities based on
>>> available and/or new hardware features.
>>> I am still in the source code digesting phase but I had a look to the
>>> fast_memcpy() implementation in opcache extension which uses SSE
intrinsics:
>>>
>>> If I am not wrong fast_memcpy() function is not currently used, as I
>>> didn't find the "-msse4.2" gcc flag in the Makefile. I assume you
probably
>>> didn't see any performance benefit so you preserved generic memcpy()
usage.
>>>
>>
>> This is not SSE4.2 this is SSE2.
>> Any X86_64 target implements SSE2, so it's enabled by default on x86_64
>> systems (at least on Linux).
>> It also may be enabled on x86 targets adding "-msse2" option.
>
>
> Right, I was gonna say, I think that was a mistake, and all x86_64 should
be using it at least...
>
> Of course, using anything newer that needs special options is nearly
useless, since I guess the vast majority aren't building themselves, but
using lowest-common-denominator repos.  I had been wondering about speeding
up some other things, maybe taking advantage of SSE4.x (string stuff, I
don't know), but... like I said.  Runtime checks would be awesome, but
except for the recent GCC, the intrinsics aren't available unless the
corresponding SSE option is enabled (lame!).  So requires a separate
compilation unit. :-/
>
> Of course I guess if the intrinsic maps simply to the instruction, could
just do it with inline asm, if wanted to do runtime CPU checking.
>
>
>>> I would like to propose a slightly different implementation which uses
>>> _mm_store_si128() instead of _mm_stream_si128(). This ensures that
copied
>>> memory is preserved in data cache, which is not bad as the interpreter
will
>>> start to use this data without the need to go back one more time to
memory.
>>> _mm_stream_si128() in the current implementation is intended to be used
for
>>> stores where we want to avoid reading data into the cache and the cache
>>> pollution; in opcache scenario it seems that preserving the data in
cache
>>> has a positive impact.
>>>
>>
>> _mm_stream_si128() was used on purpose, to avoid CPU cache pollution,
>> because data copied from SHM to process memory is not necessary used
before
>> eviction.
>> By the way, I'm not completely sure. May be _mm_store_si128() can provide
>> better result.
>
>
> Interesting (that _stream was used on purpose). :-)
>
>
>>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
>>> increase for the new version of fast_memcpy() compared with the generic
>>> memcpy(). Same result using a full load test with http_load on a
Haswell EP
>>> 18 cores.
>>>
>>
>> 1% is really big improvement.
>> I'll able to check this only on next week (when back from vacation).
>
>
> Well, he talks like he was comparing to *generic* memcpy(), so...?  But
not sure how that would have been accomplished.
>
> BTW guys, I was wondering before why fast_memcpy() only in this opcache
area?  For the prefetch and/or cache pollution reasons?

Just because, in this place we may copy big blocks, and we also may align
them properly, to use compact and fast Inlined code.

>
> Because shouldn't the library functions in glibc, etc. already be using
versions optimized for the CPU at runtime?  So is generic memcpy() already
"fast?"  (Other than overhead for a function call.)

glibc already uses optimized memcpy(), but this is universal  function,
that has to check for different conditions, like allignment of source and
distination and length.

>
>
>>> Here is the proposed pull request:
>>> https://github.com/php/php-src/pull/1446
>>>
>>> Related to the SW prefetching instructions in fast_memcpy()... they are
>>> not really useful in this place. There benefit is almost negligible as
the
>>> address requested for prefetch will be needed at the next iteration (few
>>> cycles later), while the time needed to get data from RAM is >100 cycles
>>> usually.. Nevertheless... they don't heart and it seems they still have
a
>>> very small benefit so I preserved the original instruction and I added a
>>> new prefetch request for the destination pointer.
>>>
>>
>> I also didn't see significant difference from software prefetching.
>
>
> So how about prefetching "further"/more interations ahead...?

I tried, but didn't see difference as well.

Thanks. Dmitry.

>
>
>> Thanks. Dmitry.
>>
>>
>>>
>>> Hope it helps,
>>> Bogdan
>
>
> - Matt

Reply via email to