Replying to myself: haypo found the origin of the problem. Apparently this problem stems from a GCC bug [1] (that should be fixed on x86 as of version 4.4). The bug is that GCC does not always ensure the stack to be 16 bytes aligned hence the "__m128 myvector" local variable in the previous code might not be aligned. A workaround would be to align the stack before calling the inner function as done here:
http://www.bitbucket.org/ogrisel/ctypes_sse/changeset/dc27626824b8/ New version of the previous C code: <quote> #include <stdio.h> #include <emmintrin.h> void wrapped_dummy_sse() { // allocate an alligned vector of 128 bits __m128 myvector; printf("[dummy_sse] before calling setzero\n"); fflush(stdout); // initialize it to 4 32 bits float valued to zeros myvector = _mm_setzero_ps(); printf("[dummysse] after calling setzero\n"); fflush(stdout); // display the content of the vector float* part = (float*) &myvector; printf("[dummysse] myvector = {%f, %f, %f, %f}\n", part[0], part[1], part[2], part[3]); } void dummy_sse(void) { (void)__builtin_return_address(1); // to force call frame asm volatile ("andl $-16, %%esp" ::: "%esp"); wrapped_dummy_sse(); } int main() { dummy_sse(); return 0; } </quote> [1] see e.g. for a nice summary of the issue http://www.mail-archive.com/gcc%40gcc.gnu.org/msg33101.html Another workaround would be to allocate myvector in the heap using malloc / posix_memalign for instance. Best, -- Olivie -- http://mail.python.org/mailman/listinfo/python-list