https://llvm.org/bugs/show_bug.cgi?id=26645
Bug ID: 26645 Summary: [LIR] Non-temporal aspect dropped via conversion to memset in some cases Product: libraries Version: trunk Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P Component: Loop Optimizer Assignee: unassignedb...@nondot.org Reporter: warren_ris...@playstation.sony.com CC: llvm-bugs@lists.llvm.org Classification: Unclassified Created attachment 15914 --> https://llvm.org/bugs/attachment.cgi?id=15914&action=edit test.ll In the C++ test-case below (the associated "test.ll" file is attached), a loop that clears a block of memory does so with non-temporal stores via the builtin: __builtin_ia32_movntps(__p, __a); (This is from the __mm_stream_ps() intrinsic, originally via an include of <x86intrin.h>.) Prior to r258620, the stores were done via the non-temporal store instruction 'movntps'. With r258620, the loop is transformed to a memset() call, and so the non-temporal aspect is lost, causing a performance regression relative to llvm 3.8. (Side note: r258620 was reverted at r258703, and an updated version was re-submitted at r258777. The same behavior happens at r258777, as well as current ToT (tested r261028).) It's understood that this is a code-performance issue due to cache pollution. That is, the correct answer is computed irrespective of whether the non-temporal store instructions are generated, or whether the memset() call is used. This is analogous to the situation of bug 19370, where the non-temporal aspect was lost in some situations, resulting in a performance loss due to cache pollution. Note that the loop trip count in the case below is a constant. The loop is of the form: for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) { .. } If the trip-count is a global variable, for example: unsigned int theSize = sizeof( bigBlock_t ); and the loop is changed to: for ( unsigned int index = 0; index < theSize; index += 32 ) { .. } then the non-temporal store instructions are again produced. _____________________________________________________________________ $ cat test.cpp typedef float __m128 __attribute__((__vector_size__(16))); static __inline__ __m128 __attribute__((__always_inline__, __nodebug__, __target__("sse"))) _mm_setzero_ps(void) { return (__m128){ 0, 0, 0, 0 }; } static __inline__ void __attribute__((__always_inline__, __nodebug__, __target__("sse"))) _mm_stream_ps(float *__p, __m128 __a) { __builtin_ia32_movntps(__p, __a); } struct bigBlock_t { __m128 data[256]; } __attribute__((aligned(128))); extern void nontemporal_init( bigBlock_t *p ); void nontemporal_init( bigBlock_t *p ) { float *dst = reinterpret_cast< float * >( p ); __m128 src = _mm_setzero_ps(); for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) { _mm_stream_ps( dst + 0, src ); _mm_stream_ps( dst + 4, src ); dst += 8; } } $ The "test.ll", generated from using the r258619 build as shown below, is attached: $ clang++ --version clang version 3.9.0 (trunk 258619) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: ..../llvm/bin $ clang++ -S -emit-llvm -O0 test.cpp # test.ll created here is attached $ Using opt/llc from r258619, the 'movntps' instructions can be seen: $ opt test.ll -O2 -S -o opt.ll # opt from r258619 $ llc opt.ll -o opt.s $ grep movntps opt.s # 16 'movntps' instructions, since loop is unrolled movntps %xmm0, (%rdi,%rax) movntps %xmm0, 16(%rdi,%rax) movntps %xmm0, 32(%rdi,%rax) movntps %xmm0, 48(%rdi,%rax) movntps %xmm0, 64(%rdi,%rax) movntps %xmm0, 80(%rdi,%rax) movntps %xmm0, 96(%rdi,%rax) movntps %xmm0, 112(%rdi,%rax) movntps %xmm0, 128(%rdi,%rax) movntps %xmm0, 144(%rdi,%rax) movntps %xmm0, 160(%rdi,%rax) movntps %xmm0, 176(%rdi,%rax) movntps %xmm0, 192(%rdi,%rax) movntps %xmm0, 208(%rdi,%rax) movntps %xmm0, 224(%rdi,%rax) movntps %xmm0, 240(%rdi,%rax) $ grep memset opt.s $ Using opt/llc from r258620, the 'movntps' instructions are no longer there, and instead there is a call to memset(): $ opt test.ll -O2 -S -o opt.ll # opt from r258620 $ llc opt.ll -o opt.s $ grep movntps opt.s $ grep memset opt.s callq memset $ -- You are receiving this mail because: You are on the CC list for the bug.
_______________________________________________ llvm-bugs mailing list llvm-bugs@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs