On Jul 31, 2015 2:12 AM, "Matt Wilmas" <php_li...@realplain.com> wrote: > > Hi Dmitry, Bogdan, > > > ----- Original Message ----- > From: "Dmitry Stogov" > Sent: Thursday, July 30, 2015 > >> Hi Bogdan, >> >> On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan <bogdan.and...@intel.com> >> wrote: >> >>> Hi Guys, >>> >>> My name is Bogdan Andone and I work for Intel in the area of SW >>> performance analysis and optimizations. >>> We would like to actively contribute to Zend PHP project and to involve >>> ourselves in finding new performance improvement opportunities based on >>> available and/or new hardware features. >>> I am still in the source code digesting phase but I had a look to the >>> fast_memcpy() implementation in opcache extension which uses SSE intrinsics: >>> >>> If I am not wrong fast_memcpy() function is not currently used, as I >>> didn't find the "-msse4.2" gcc flag in the Makefile. I assume you probably >>> didn't see any performance benefit so you preserved generic memcpy() usage. >>> >> >> This is not SSE4.2 this is SSE2. >> Any X86_64 target implements SSE2, so it's enabled by default on x86_64 >> systems (at least on Linux). >> It also may be enabled on x86 targets adding "-msse2" option. > > > Right, I was gonna say, I think that was a mistake, and all x86_64 should be using it at least... > > Of course, using anything newer that needs special options is nearly useless, since I guess the vast majority aren't building themselves, but using lowest-common-denominator repos. I had been wondering about speeding up some other things, maybe taking advantage of SSE4.x (string stuff, I don't know), but... like I said. Runtime checks would be awesome, but except for the recent GCC, the intrinsics aren't available unless the corresponding SSE option is enabled (lame!). So requires a separate compilation unit. :-/ > > Of course I guess if the intrinsic maps simply to the instruction, could just do it with inline asm, if wanted to do runtime CPU checking. > > >>> I would like to propose a slightly different implementation which uses >>> _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied >>> memory is preserved in data cache, which is not bad as the interpreter will >>> start to use this data without the need to go back one more time to memory. >>> _mm_stream_si128() in the current implementation is intended to be used for >>> stores where we want to avoid reading data into the cache and the cache >>> pollution; in opcache scenario it seems that preserving the data in cache >>> has a positive impact. >>> >> >> _mm_stream_si128() was used on purpose, to avoid CPU cache pollution, >> because data copied from SHM to process memory is not necessary used before >> eviction. >> By the way, I'm not completely sure. May be _mm_store_si128() can provide >> better result. > > > Interesting (that _stream was used on purpose). :-) > > >>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance >>> increase for the new version of fast_memcpy() compared with the generic >>> memcpy(). Same result using a full load test with http_load on a Haswell EP >>> 18 cores. >>> >> >> 1% is really big improvement. >> I'll able to check this only on next week (when back from vacation). > > > Well, he talks like he was comparing to *generic* memcpy(), so...? But not sure how that would have been accomplished. > > BTW guys, I was wondering before why fast_memcpy() only in this opcache area? For the prefetch and/or cache pollution reasons?
Just because, in this place we may copy big blocks, and we also may align them properly, to use compact and fast Inlined code. > > Because shouldn't the library functions in glibc, etc. already be using versions optimized for the CPU at runtime? So is generic memcpy() already "fast?" (Other than overhead for a function call.) glibc already uses optimized memcpy(), but this is universal function, that has to check for different conditions, like allignment of source and distination and length. > > >>> Here is the proposed pull request: >>> https://github.com/php/php-src/pull/1446 >>> >>> Related to the SW prefetching instructions in fast_memcpy()... they are >>> not really useful in this place. There benefit is almost negligible as the >>> address requested for prefetch will be needed at the next iteration (few >>> cycles later), while the time needed to get data from RAM is >100 cycles >>> usually.. Nevertheless... they don't heart and it seems they still have a >>> very small benefit so I preserved the original instruction and I added a >>> new prefetch request for the destination pointer. >>> >> >> I also didn't see significant difference from software prefetching. > > > So how about prefetching "further"/more interations ahead...? I tried, but didn't see difference as well. Thanks. Dmitry. > > >> Thanks. Dmitry. >> >> >>> >>> Hope it helps, >>> Bogdan > > > - Matt