In order to perform faster loads from an unaligned memory location to a SSE
register, a common trick is to replace the default unaligned load instructions
(e.g., MOVUPS for floats) by one MOVSD followed by one MOVHPS. Using
intrinsics, this can be implemented like this:
inline __m128 ploadu(const float* from) {
__m128 r;
r = _mm_castpd_ps(_mm_load_sd((double*)(from)));
r = _mm_loadh_pi(r, (const __m64*)(from+2));
return r;
}
Unfortunately, when optimizations are enabled (-O2), I found that GCC can
incorrectly reorder the instructions leading to invalid code. For instance, in
that example:
float data[4] = {1, 2, 3, 4};
__attribute__ ((aligned(16))) float aligned_data[4];
_mm_store_ps(aligned_data, ploadu(data));
std::cout << aligned_data[0] << " " << aligned_data[1] << " "
<< aligned_data[2] << " " << aligned_data[3] << "\n";
GCC generates the following ASM:
movsd 32(%rsp), %xmm0
movl $0x40400000, 40(%rsp)
movl $0x40800000, 44(%rsp)
movl $0x3f800000, 32(%rsp)
movhps 40(%rsp), %xmm0
movl $0x40000000, 36(%rsp)
movaps %xmm0, 16(%rsp)
where the MOVSD instruction is executed before the values of the array "data"
have been set.
If we use the standard _mm_loadu_ps intrinsics, then the generated ASM is
obviously correct:
movl $0x3f800000, 32(%rsp)
movl $0x40000000, 36(%rsp)
movl $0x40400000, 40(%rsp)
movl $0x40800000, 44(%rsp)
movups 32(%rsp), %xmm0
movaps %xmm0, 16(%rsp)
Please, see the attachment for a complete example.
--
Summary: wrong instr. dependency with some SSE intrinsics
Product: gcc
Version: 4.3.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: gael dot guennebaud at gmail dot com
GCC build triplet: x86_64-pc-linux
GCC host triplet: x86_64-pc-linux
GCC target triplet: x86_64-pc-linux
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40537