I have been working on slightly modifying a software package by Sean Eddy called Hmmer 3. The hardware acceleration was originally SSE2 but since most of our compute nodes only have SSE1 and MMX I rewrote a few small sections to just use those instructions. (And yes, as far as I can tell it invokes emms before any floating point operations are run after each MMX usage.) On top of that each binary has 3 options for running the programs: single threaded, threaded, or MPI (using Ompi143). For all other programs in this package everything works everywhere. For one called "jackhmmer" this table results (+=runs correctly, - = problems), where the exact same problem is run in each test (theoretically exercising exactly the same routines, just under different threading control):
SSE2 SSE1 Single + + Threaded + + Ompi143 + - The negative result for the SSE/Ompi143 combination happens whether the worker nodes are Athlon MP (SSE1 only) or Athlon64. The test machine for the single and threaded runs is a two CPU Opteron 280 (4 cores total). Ompi143 is 32 bit everywhere (local copies though). There have been no modifications whatsoever made to the main jackhmmer.c file, which is where the various run methods are implemented. Now if there was some intrinsic problem with my SSE1 code it should presumably manifest in both the Single and Threaded versions as well (the thread control is different, but they all feed through the same underlying functions), or in one of the other programs, which isn't seen. Running under valgrind using Single or Threaded produces no warnings. Using mpirun with valgrind on the SSE2 produces 3: two related to OMPI itself which are seen in every OMPI program run in valgrind, and one caused by an MPIsend operation where the buffer contains some uninitialized data (this is nothing toxic, just bytes in fixed length fields which which were never set because a shorter string is stored there). ==19802== Syscall param writev(vector[...]) points to uninitialised byte(s) ==19802== at 0x4C77AC1: writev (in /lib/libc-2.10.1.so) ==19802== by 0x8A069B5: mca_btl_tcp_frag_send (in /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so) ==19802== by 0x8A0626E: mca_btl_tcp_endpoint_send (in /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so) ==19802== by 0x8A01ADC: mca_btl_tcp_send (in /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so) ==19802== by 0x7FA24A9: mca_pml_ob1_send_request_start_prepare (in /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so) ==19802== by 0x7F98443: mca_pml_ob1_send (in /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so) ==19802== by 0x4A8530F: PMPI_Send (in /opt/ompi143.X32/lib/libmpi.so.0.0.2) ==19802== by 0x808D5F2: p7_oprofile_MPISend (mpi.c:101) ==19802== by 0x805762E: main (jackhmmer.c:1149) ==19802== Address 0x770bc9d is 15,101 bytes inside a block of size 15,389 alloc'd ==19802== at 0x49E3A12: realloc (vg_replace_malloc.c:476) ==19802== by 0x808D4E3: p7_oprofile_MPISend (mpi.c:88) ==19802== by 0x805762E: main (jackhmmer.c:1149) Do that for the SSE1 version and the same 3 errors are seen, plus many more like the following: ==9416== Conditional jump or move depends on uninitialised value(s) ==9416== at 0x807FE3E: forward_engine (fwdback.c:420) ==9416== by 0x8080051: p7_ForwardParser (fwdback.c:143) ==9416== by 0x806C3CC: p7_Pipeline (p7_pipeline.c:590) ==9416== by 0x80564F0: main (jackhmmer.c:1426) Unfortunately this makes absolutely no sense. Line 420 is if (xE > 1.0e4) which tells us that xE wasn't set (fine), so assaying uninitialized with statements like: fprintf(stderr,"DEBUG xEv %lld\n",xEv);fflush(stderr); (each of which generates its own uninitialized value message) the first uninitialized variable appears very early in the code after this _mm_setzero_ps: register __m128 xEv; //other stuff that does not touch xEv xEv = _mm_setzero_ps(); Now this is hair pulling for many reasons. The first is that nothing of substance was changed in this file (just some #defines that resolve to the same values as they had originally). The second is that this is an SSE1 operation even in the original unmodified code. The third is that it just isn't possible for xEv to be uninitialized after that statement - yet it is. (Valgrind with --smc-check=all turns up nothing more than leaving out that parameter.) Here is the relevant section in xmmintrin.h: /* Create a vector of zeros. */ extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_setzero_ps (void) { return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f }; } Of course all of this nonsense is happening on a worker node, which isn't making getting to the root of the problem any easier. The module where these uninitialized variables are seen was compiled like; mpicc -std=gnu99 -O1 -g -m32 -pthread -msse -mno-sse2 -DHAVE_CONFIG_H -I../../easel -I../../easel -I. -I.. -I. -I../../src -o fwdback.o -c fwdback.c Building it on a 64 bit machine (that's why the -m32 is there) or a 32 bit machine gives the same result. If any of you have seen something like this before and can suggest a way to proceed I would be very grateful. Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech