Title: s/thresh/thrash/ On Wed, 2017-08-23 at 13:55 +0100, Chris Wilson wrote: > At the moment, the verify tests use an extremely brutal write-read of > every dword, degrading performance to UC. If we break those up into > cachelines, we can do a wcb write/read at a time instead, roughly 8x > faster. We lose the accuracy of the forced wcb flushes around every dword, > but we are retaining the overall behaviour of checking reads following > writes instead. To compensate, we do check that a single dword write/read > before using wcb aligned accesses. > > Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
<SNIP> > @@ -104,15 +109,78 @@ bo_copy (void *_arg) > return NULL; > } > > +#if defined(__x86_64__) && !defined(__clang__) > +#define MOVNT 512 > + > +#pragma GCC push_options > +#pragma GCC target("sse4.1") > + > +#include <smmintrin.h> > +__attribute__((noinline)) > +static void copy_wc_page(void *dst, void *src) > +{ > + if (igt_x86_features() & SSE4_1) { > + __m128i *S = (__m128i *)src; > + __m128i *D = (__m128i *)dst; > + > + for (int i = 0; i < PAGE_SIZE/CACHELINE; i++) { > + __m128i tmp[4]; > + > + tmp[0] = _mm_stream_load_si128(S++); > + tmp[1] = _mm_stream_load_si128(S++); > + tmp[2] = _mm_stream_load_si128(S++); > + tmp[3] = _mm_stream_load_si128(S++); > + > + _mm_store_si128(D++, tmp[0]); > + _mm_store_si128(D++, tmp[1]); > + _mm_store_si128(D++, tmp[2]); > + _mm_store_si128(D++, tmp[3]); > + } > + } else > + memcpy(dst, src, PAGE_SIZE); > +} Not lib/ material? Add newline anyway. Reviewed-by: Joonas Lahtinen <joonas.lahti...@linux.intel.com> Regards, Joonas -- Joonas Lahtinen Open Source Technology Center Intel Corporation _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx