Replying to myself:

haypo found the origin of the problem. Apparently this problem stems
from a GCC bug [1]  (that should be fixed on x86 as of version 4.4).
The bug is that GCC does not always ensure the stack to be 16 bytes
aligned hence the "__m128 myvector" local variable in the previous
code might not be aligned. A workaround would be to align the stack
before calling the inner function as done here:

http://www.bitbucket.org/ogrisel/ctypes_sse/changeset/dc27626824b8/

New version of the previous C code:

<quote>

#include <stdio.h>
#include <emmintrin.h>


void wrapped_dummy_sse()
{
        // allocate an alligned vector of 128 bits
        __m128 myvector;

        printf("[dummy_sse] before calling setzero\n");
        fflush(stdout);

        // initialize it to 4 32 bits float valued to zeros
        myvector = _mm_setzero_ps();

        printf("[dummysse] after calling setzero\n");
        fflush(stdout);

        // display the content of the vector
        float* part = (float*) &myvector;
        printf("[dummysse] myvector = {%f, %f, %f, %f}\n",
                        part[0], part[1], part[2], part[3]);
}

void dummy_sse(void)
{
        (void)__builtin_return_address(1); // to force call frame
        asm volatile ("andl $-16, %%esp" ::: "%esp");
        wrapped_dummy_sse();
}

int main()
{
        dummy_sse();
        return 0;
}

</quote>

[1] see e.g. for a nice summary of the issue
http://www.mail-archive.com/gcc%40gcc.gnu.org/msg33101.html

Another workaround would be to allocate myvector in the heap using
malloc / posix_memalign for instance.

Best,

-- 
Olivie
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to