Method to test all sse2 calls?
Hi, I just finished coding a software implementation of emmintrin.h. It removes all of the builtins and uses inlined C functions instead. This is to allow SSE2 based code to run, albeit slowly, on machines without SSE2 support. I am looking for a program, script, or whatever that can be used to test all 200 odd _mm_* SSE2 functions to make sure that they actually work right. Presumably such a thing must be included in the gcc distribution, so that a new build can be verified to work properly. However I have no clue where it would be or how to use it. Can somebody please point me in the right direction? Thank you, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Method to disable code SSE2 generation but still use -msse2
My software implementation of SSE2 now passes all the testsuite programs. In case anybody else ever needs this, it is here: http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/soft_emmintrin.h I compiled that with a target program and gprof showed all the time in resulting binary in the inlined functions. It ran about 4X slower than the SSE2 hardware version, which is about what I expected. So, so far so good. What I am worried about now is that since it was invoked with "-msse2" the compiler may still be generating SSE2 calls within the inlined functions. Is there a way to definitively disable this but still retain -msse2 on the command line? For instance, here is one of the software version inline functions: /* vector subtract the two doubles in an __m128d */ static __inline __m128d __attribute__((__always_inline__)) _mm_sub_pd (__m128d __A, __m128d __B) { return (__m128d)((__v2df)__A - (__v2df)__B); } In the original gcc emmintrin.h that called a builtin _explicitly_. I also want to avoid having the compiler use the same builtin _implicitly_. If it uses SSE, 3DNOW or MMX implicitly, in this example, that would be fine, it just cannot use any SSE2 hardware. Actually, one thing I was never very clear on, do -msse2 -m3dnow etc. only provide access to the corresponding machine operations through the _mm* (or whatever) definitions in the header file, or does the compiler also figure out vector operations by itself during the optimization phase of compilation? Thank you, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to disable code SSE2 generation but still use -msse2
Ian Lance Taylor wrote: > No. If I understand what you are doing, I don't think you want to use > -msse2 at all. In fact I think you want -mno-sse2. Following your suggestion mo-sse2 was tried, which generated an error message well beyond my comprehension: gcc -std=gnu99 -g -pg -pthread -O4 -DSOFT_SSE2 -msse -mno-sse2 -DHAVE_CONFIG_H -I../../easel -I../../easel -I. -I.. -I. -I../../src -o msvfilter.o -c msvfilter.c msvfilter.c: In function 'p7_MSVFilter': msvfilter.c:208: error: unable to find a register to spill in class 'GENERAL_REGS' msvfilter.c:208: error: this is the insn: (insn:HI 3569 3568 3570 302 ../../easel/emmintrin.h:2334 (set (strict_low_part (subreg:HI (reg:TI 1514) 0)) (mem:HI (plus:SI (reg/f:SI 20 frame) (const_int -30 [0xffe2])) [14 S2 A16])) 40 {*movstricthi_1} (insn_list:REG_DEP_TRUE 3568 (nil)) (nil)) msvfilter.c:208: confused by earlier errors, bailing out make: *** [msvfilter.o] Error 1 line 208 in msvfilter.c is the closing "}" on the p7_MSVFilter function. line 2334 in emmintrin.h is the return statement in the snippet below static __inline __m128i __attribute__((__always_inline__)) _mm_shufflelo_epi16(__m128i __A, int __B){ __v8hi __tmp = { EMM_UINT2(__A)[__B& 3], EMM_UINT2(__A)[__B>>2 & 3], EMM_UINT2(__A)[__B>>4 & 3], EMM_UINT2(__A)[__B>>6 & 3], EMM_UINT2(__A)[4], EMM_UINT2(__A)[5], EMM_UINT2(__A)[6], EMM_UINT2(__A)[7]}; return (__m128i) __tmp; } where HMM_UINT2 is this: #define EMM_UINT2(a) ((unsigned short *)&(a)) If -mno-sse2 is changed to -msse2 that compile completes without errors or warnings. gcc --version is: gcc (GCC) 4.2.3 (4.2.3-6mnb1) What does that compiler error mean? Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to disable code SSE2 generation but still use -msse2
The last mysterious error message went away when the same code was compiled on a machine with a more recent gcc (4.4.1). Shortly after I hit the next roadblock. Here is foo.c (a modified version of sse2-cmpsd-1.c from the version 4.5.1 testsuite): >8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8 #ifndef CHECK_H #define CHECK_H "sse2-check.h" #endif #ifndef TEST #define TEST sse2_test #endif #include CHECK_H #include static __m128d __attribute__((noinline, unused)) test (__m128d s1, __m128d s2) { printf("test s1.x"); _mm_dump_fd(s1); printf("test s2.x"); _mm_dump_fd(s2); return _mm_add_pd (s1, s2); } static void TEST (void) { union128d u, s1, s2; double e[2]; s1.x = _mm_set_pd (2134.3343,1234.635654); s2.x = _mm_set_pd (41124.234,2344.2354); printf("s10 1 %lf %lf\n",s1.a[0],s1.a[1]); printf("s20 1 %lf %lf\n",s2.a[0],s2.a[1]); printf("s1.x"); _mm_dump_fd(s1.x); printf("s2.x"); _mm_dump_fd(s2.x); u.x = test (s1.x, s2.x); e[0] = s1.a[0] + s2.a[0]; e[1] = s1.a[1] + s2.a[1]; printf("s1.x"); _mm_dump_fd(s1.x); printf("s2.x"); _mm_dump_fd(s2.x); printf("expected e0 e1 %lf %lf\n",e[0],e[1]); printf("result r0 r1 %lf %lf\n",u.a[0],u.a[1]); if (check_union128d (u, e)) abort (); } >8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8 When compiled with -mno-sse2 the run fails. Bizarrely, it seems to be passing data into the test function incorrectly, notice that in test the low double in s2 is the high double in s1, instead of the original low double in s2 from outside the calling function. This erroneous value propagates into my inline code where it is added (correctly, but of course to the wrong final sum since the inputs were wrong). gcc -Wall -msse -mno-sse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG -O1 -o foo_wno foo.c ./foo_wno mm_set_pd, in 2134.334300 1234.635654 mm_set_pd, in 41124.234000 2344.235400 s10 1 1234.635654 2134.334300 s20 1 2344.235400 41124.234000 s1.xDEBUG m_d_fd: 1234.635654 2134.334300 s2.xDEBUG m_d_fd: 2344.235400 41124.234000 test s1.xDEBUG m_d_fd: 1234.635654 2134.334300 test s2.xDEBUG m_d_fd: 2134.334300 41124.234000 IN _mm_add_pd __ADEBUG m_d_fd: 1234.635654 2134.334300 __BDEBUG m_d_fd: 2134.334300 41124.234000 s1.xDEBUG m_d_fd: 1234.635654 2134.334300 s2.xDEBUG m_d_fd: 2344.235400 41124.234000 expected e0 e1 3578.871054 43258.568300 result r0 r1 3368.969954 43258.568300 Aborted when -msse2 is enabled however, the parameters are passed appropriately into test (and my inlined function), and the program works. Here the pass to the test function is correct, and that propagates into my inline function correctly too: gcc -Wall -msse -msse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG -O1 -o foo_nono foo.c [r...@newsaf i386]# ./foo_nono mm_set_pd, in 2134.334300 1234.635654 mm_set_pd, in 41124.234000 2344.235400 s10 1 1234.635654 2134.334300 s20 1 2344.235400 41124.234000 s1.xDEBUG m_d_fd: 1234.635654 2134.334300 s2.xDEBUG m_d_fd: 2344.235400 41124.234000 test s1.xDEBUG m_d_fd: 1234.635654 2134.334300 test s2.xDEBUG m_d_fd: 2344.235400 41124.234000 IN _mm_add_pd __ADEBUG m_d_fd: 1234.635654 2134.334300 __BDEBUG m_d_fd: 2344.235400 41124.234000 s1.xDEBUG m_d_fd: 1234.635654 2134.334300 s2.xDEBUG m_d_fd: 2344.235400 41124.234000 expected e0 e1 3578.871054 43258.568300 result r0 r1 3578.871054 43258.568300 Regards, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to disable code SSE2 generation but still use -msse2
I have found several ways to "fix" the latest issue, but they all boil down to never passing an __m128d value on the call stack. For instance change static __m128d __attribute__((noinline, unused)) test (__m128d s1, __m128d s2) to static __m128d test (__m128d s1, __m128d s2) and the program works. Similarly, change the function to static __m128d __attribute__((noinline)) test (__m128d *s1, __m128d *s2) { return _mm_add_pd (*s1, *s2); } and it also works. Things I tried to force a 16 byte stack alignment that didn't work: 1 -mstackrealign 2 -mpreferred-stack-boundary=4 3 -mincoming-stack-boundary=4 4 2 and 3 5 1 and 2 and 3 I guess the bigger question is why can an __m128d be passed on the call stack reliably when -msse2 is invoked, but not otherwise? If the compiler cannot do this reliably shouldn't it throw an error or warning? Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to disable code SSE2 generation but still use -msse2
> Things I tried to force a 16 byte stack alignment that didn't work: > > 1 -mstackrealign > 2 -mpreferred-stack-boundary=4 > 3 -mincoming-stack-boundary=4 > 4 2 and 3 > 5 1 and 2 and 3 And this is why they didn't work. Change the test function to static __m128d __attribute__((noinline,aligned (16))) test ( __m128d s1, __m128d s2) { printf("test s1"); _mm_dump_fd(s1); printf("test s2"); _mm_dump_fd(s2); printf("loc s1 %p\n",&s1); printf("loc s2 %p\n",&s2); return _mm_add_pd (s1, s2); } compile and run gcc -Wall -msse -mno-sse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG -O1 -o foo_wno foo.c [r...@newsaf i386]# ./foo_wno mm_set_pd, in 2134.334300 1234.635654 mm_set_pd, in 41124.234000 2344.235400 s10 1 1234.635654 2134.334300 s20 1 2344.235400 41124.234000 s1.xDEBUG m_d_fd: 1234.635654 2134.334300 s2.xDEBUG m_d_fd: 2344.235400 41124.234000 test s1DEBUG m_d_fd: 1234.635654 2134.334300 test s2DEBUG m_d_fd: 2134.334300 41124.234000 loc s1 0x7fff6b6ccb10 <-- loc s2 0x7fff6b6ccb00 <-- s1.xDEBUG m_d_fd: 1234.635654 2134.334300 s2.xDEBUG m_d_fd: 2344.235400 41124.234000 expected e0 e1 3578.871054 43258.568300 result r0 r1 3368.969954 43258.568300 Aborted s1 and s2 within test are already 16 byte aligned, so the extra alignment switches did not help. Somehow this code u.x = test (s1.x, s2.x); is putting the wrong values for s2 onto the call stack. Bizarre. Either I'm missing something or turning off SSE2 is uncovering a compiler bug. Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to disable code SSE2 generation but still use -msse2
I renamed the test case gccprob.c and made two binaries and two assembler files: gcc -Wall -msse -mno-sse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -o gccprob_wno gccprob.c gcc -Wall -msse -mno-sse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -S -o gccprob_wno.s gccprob.c gcc -Wall -msse -msse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -S -o gccprob_nono.s gccprob.c gcc -Wall -msse -msse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -o gccprob_nono gccprob.c The _wno variants have the problem passing __m128d on the stack, the _nono varients do not. packed up all 5 files and put them here (retrieve only directory, no directory listings in pickup): http://saf.bio.caltech.edu/pub/pickup/gccprob.tar.gz I am not an assembler programmer. If one of you who is could have a look at the two .s files maybe we can get to the bottom of this. Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to disable code SSE2 generation but still use -msse2
The problem is specific for 64 bit environments, made these: gcc -Wall -msse -mno-sse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -m32 -S -o gccprob_wno32.s gccprob.c gcc -Wall -msse -mno-sse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -m32 -o gccprob_wno32 gccprob.c gcc -Wall -msse -msse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -m32 -o gccprob_nono32 gccprob.c gcc -Wall -msse -msse2 -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \ -O0 -m32 -S -o gccprob_nono32.s gccprob.c and both binaries work correctly. Added them to the set here: http://saf.bio.caltech.edu/pub/pickup/gccprob.tar.gz Specifics on the environment where the problem is seen: OS: Mandriva Linux release 2010.0 (Official) for x86_64 gcc (GCC) 4.4.1 Dual Dual Core Opteron 280. Arima HDAMAI motherboard. 64 bit targets only, 32 bit is OK. Regards, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to test all sse2 calls?
What is: __builtin_ia32_vec_ext_v2df ??? It wasn't in the original emmintrin.h, so presumably isn't actually part of SSE2, but it is present in the testsuite, and it is not visible to the compiler when -mno-sse2 is set. See for instance the files sse2-vec-#.c. (Randomly selected) Example: sse2-vec-4.c: res[2] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 2); gcc -Wall -msse -mno-sse2 -I. -m32 -lm -DSOFT_SSE2 -o foo sse2-vec-4.c sse2-vec-4.c: In function 'sse2_test': sse2-vec-4.c:27: warning: implicit declaration of function '__builtin_ia32_vec_ext_v8hi' /root/tmp/ccYAq3IB.o: In function `sse2_test': sse2-vec-4.c:(.text+0x58c): undefined reference to `__builtin_ia32_vec_ext_v8hi' . . . /root/tmp/ccYAq3IB.o:sse2-vec-4.c:(.text+0x613): more undefined references to `__builtin_ia32_vec_ext_v8hi' follow collect2: ld returned 1 exit status Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Re: Method to test all sse2 calls?
Ian Lance Taylor , wrote: > Tests that directly invoke __builtin functions are not appropriate for > your replacement for emmintrin.h. Clearly. However, I do not see why these are in the test routines in the first place. They seem not to be needed. I made the changes below my signature, eliminating all of the vector builtins, and the programs still worked with both -msse2 and -mno-sse2 plus my software SSE2. If anything the test programs are much easier to understand without the builtins. There is also a (big) problem with sse2-vec-2.c (and -2a, which is empty other than an #include sse2-vec-2.c). There are no explicit sse2 operations within this test program. Moreover, the code within the tests does not work. Finally, if one puts a print statement anywhere in the test that is there, compiles it with: gcc -msse -msse2 there will be no warnings, and the run will appear to show a valid test, but in actuality the test will never execute! This shows part of the problem: gcc -Wall -msse -msse2 -o foo sse2-vec-2.c sse-os-support.h:27: warning: 'sse_os_support' defined but not used sse2-check.h:10: warning: 'do_test' defined but not used (also for -m64) There must be some sort of main in there, but no test, it does nothing and returns a valid status. When stuffed with debug statements: for (i = 0; i < 2; i++) masks[i] = i; printf("DEBUG res[0] %llX\n",res[0]); printf("DEBUG res[1] %llX\n",res[1]); printf("DEBUG val1.ll[0] %llX\n",val1.ll[0]); printf("DEBUG val1.ll[1] %llX\n",val1.ll[1]); for (i = 0; i < 2; i++) if (res[i] != val1.ll [masks[i]]){ printf("DEBUG i %d\n",i); printf("DEBUG masks[i] %d\n",masks[i]); printf("DEBUG val1.ll [masks[i]] %llX\n", val1.ll [masks[i]]); abort (); } and compiled with my software SSE2 gcc -Wall -msse -mno-sse2 -I. -O0 -m32 -lm -DSOFT_SSE2 -DEMMSOFTDBG -o foo sse2-vec-2.c It emits: DEBUG res[0] 3020100 DEBUG res[1] 7060504 DEBUG val1.ll[0] 706050403020100 DEBUG val1.ll[1] F0E0D0C0B0A0908 DEBUG i 0 DEBUG masks[i] 0 DEBUG val1.ll [masks[i]] 706050403020100 Aborted True enough 3020100 != 706050403020100, but what kind of test is that??? Regards, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech changes to sse2-vec-*.c routines to eliminate all of the __builtin calls: ls -1 sse2-vec*dist | grep -v vec-2 | extract -cols 'diff --context=0 [1,-6] [1,]' | execinput *** sse2-vec-1.c2010-11-24 09:06:46.0 -0800 --- sse2-vec-1.c.dist 2010-11-24 09:06:39.0 -0800 *** *** 27,28 ! res[0] = val1.d[msk0]; ! res[1] = val1.d[msk1]; --- 27,28 ! res[0] = __builtin_ia32_vec_ext_v2df ((__v2df)val1.x, msk0); ! res[1] = __builtin_ia32_vec_ext_v2df ((__v2df)val1.x, msk1); *** sse2-vec-3.c2010-11-24 09:09:13.0 -0800 --- sse2-vec-3.c.dist 2010-11-24 09:07:48.0 -0800 *** *** 27,30 ! res[0] = val1.i[0]; ! res[1] = val1.i[1]; ! res[2] = val1.i[2]; ! res[3] = val1.i[3]; --- 27,30 ! res[0] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 0); ! res[1] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 1); ! res[2] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 2); ! res[3] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 3); *** sse2-vec-4.c2010-11-24 09:10:00.0 -0800 --- sse2-vec-4.c.dist 2010-11-24 09:07:48.0 -0800 *** *** 27,34 ! res[0] = val1.s[0]; ! res[1] = val1.s[1]; ! res[2] = val1.s[2]; ! res[3] = val1.s[3]; ! res[4] = val1.s[4]; ! res[5] = val1.s[5]; ! res[6] = val1.s[6]; ! res[7] = val1.s[7]; --- 27,34 ! res[0] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 0); ! res[1] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 1); ! res[2] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 2); ! res[3] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 3); ! res[4] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 4); ! res[5] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 5); ! res[6] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 6); ! res[7] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 7); *** sse2-vec-5.c2010-11-24 09:11:09.0 -0800 --- sse2-vec-5.c.dist 2010-11-24 09:07:48.0 -0800 *** *** 27,42 ! res[0] = val1.c[0]; ! res[1] = val1.c[1]; ! res[2] = val1.c[2]; ! res[3] = val1.c[3]; ! res[4] = val1.c[4]; ! res[5] = val1.c[5]; ! res[6] = val1.c[6]; ! res[7] = val1.c[7]; ! res[8] = val1.c[8]; ! res[9] = val1.c[9]; ! res[10] = val1.c[10]; ! res[11] = val1.c[11]; ! res[12] = val1.c[12]; ! res[13] = val1.c[13]; ! res[14] = val1.c[14]; ! res[15] = val1.c[15]; --- 27,42 ! res[0] = __builtin_ia32_vec_ext_v16qi ((__v16qi)val1.x, 0); ! res[1] = __builtin_ia32_vec_ext_v16qi ((__v16qi)val1.
Re: Method to test all sse2 calls?
Ian Lance Taylor wrote: > Your changes are relying on a gcc extension which was only recently > added, more recently than those tests were added to the testsuite. Only > recently did gcc acquire the ability to use [] to access elements in a > vector. That isn't what my changes did. The array accesses are to the arrays in the union - nothing cutting edge there. The data is accessed through the array specified by .d (or .s etc.) not to name.x[index]. > So I think you may have misinterpreted the __builtin_ia32_vec_ext_v2di > builtin function. That function treats the vector as containing two > 8-byte integers, and pulls out one or the other depending on the second > argument. Your dumps of res[0] and res[1] suggest that you are treating > the vector as four 4-byte integers and pulling out specific ones. Yup, my bad, put in d where it should have been ll. Also fixed the problem I induced in sse2-check.h, where too large a chunk was commented out, that was causing the gcc -Wall -msse2 problem. The changed part in the original source was if ((edx & bit_SSE2) && sse_os_support ()) and is now: #if !defined(SOFT_SSE2) if ((edx & bit_SSE2) && sse_os_support ()) #else if (sse_os_support ()) #endif /*SOFT_SSE2*/ My software SSE2 passes all 165 of the sse2 tests that are complete programs. However, there is a problem in the real world. While the sse2 programs in the testsuite do exercise the _mm* functions, they do so one at a time. I have found that in real code, which makes multiple _mm* calls, if -O0 is not used, the wrong results (may) come out. % gcc -std=gnu99 -g -pg -pthread -O0 -msse -mno-sse2 -DSOFT_SSE2 -m32 -g -pg -DHAVE_CONFIG_H -L../../easel -L.. -L. -I../../easel -I../../easel -I. -I.. -I. -I../../src -Dp7MSVFILTER_TESTDRIVE -o msvfilter_utest ./msvfilter.c -Wl,--start-group -lhmmer -lhmmerimpl -Wall -Wl,--end-group -leasel -lm % ./msvfilter_utest (no output, it ran correctly) % gcc -std=gnu99 -g -pg -pthread -O1 -msse -mno-sse2 -DSOFT_SSE2 -m32 -g -pg -DHAVE_CONFIG_H -L../../easel -L.. -L. -I../../easel -I../../easel -I. -I.. -I. -I../../src -Dp7MSVFILTER_TESTDRIVE -o msvfilter_utest ./msvfilter.c -Wl,--start-group -lhmmer -lhmmerimpl -Wall -Wl,--end-group -leasel -lm % ./msvfilter_utest msv filter unit test failed: scores differ (-50.37, -10.86) Going to higher optimization and there are even bigger issues, like not compiling at all (even with gcc 4.4.1): % gcc -std=gnu99 -g -pg -pthread -O2 -msse -mno-sse2 -DSOFT_SSE2 -m32 -g -pg -DHAVE_CONFIG_H -L../../easel -L.. -L. -I../../easel -I../../easel -I. -I.. -I. -I../../src -Dp7MSVFILTER_TESTDRIVE -o msvfilter_utest ./msvfilter.c -Wl,--start-group -lhmmer -lhmmerimpl -Wall -Wl,--end-group -leasel -lm ../../easel/emmintrin.h:2178: warning: dereferencing pointer '({anonymous})' does break strict-aliasing rules ../../easel/emmintrin.h:2178: note: initialized from here . . (same sort of message many many times) . ./msvfilter.c:208: error: unable to find a register to spill in class 'GENERAL_REGS' ./msvfilter.c:208: error: this is the insn: (insn 1944 1943 1945 46 ../../easel/emmintrin.h:2348 (set (strict_low_part (subreg:HI (reg:TI 1239) 0)) (mem:HI (reg/f:SI 96 [ pretmp.1031 ]) [13 S2 A16])) 47 {*movstricthi_1} (nil)) ./msvfilter.c:208: confused by earlier errors, bailing out Would changing the use of inlined functions to defines let the compiler digest it better? For instance: static __inline __m128i __attribute__((__always_inline__)) _mm_andnot_si128 (__m128i __A, __m128i __B) { return (~__A) & __B; } becomes #define _mm_andnot_si128(A,B) (~A & B) That approach will get really messy for the more complicated _mm*. In general terms, can somebody give me a hint as to the sorts of things that if found in inlined functions might cause the compiler to optimize to invalid code? Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
best data extraction for SSE2 emulation?
In my software SSE2 emulation I am currently using this sort of approach to extract data fields out of __m128i and __m128d vectors: #define EMM_SINT4(a) ((int *)&(a)) static __inline __m128i __attribute__((__always_inline__)) _mm_slli_epi32 (__m128i __A, int __B) { __v4si __tmp = { EMM_SINT4(__A)[0] << __B, EMM_SINT4(__A)[1] << __B, EMM_SINT4(__A)[2] << __B, EMM_SINT4(__A)[3] << __B}; return (__m128i)__tmp; } This works fine when testing one _mm function at a time, but does not work reliably in real programs unless -O0 is used. I think at least part of the problem is that once the function is inlined the parameter __A is in some cases a register variable, and the pointer method is not valid there. To get around that I'm think of introducing an explicit local variable, like this: static __inline __m128i __attribute__((__always_inline__)) _mm_slli_epi32 (__m128i __A, int __B) { __m128i A = __A; __v4si __tmp = { EMM_SINT4(A)[0] << __B, EMM_SINT4(A)[1] << __B, EMM_SINT4(A)[2] << __B, EMM_SINT4(A)[3] << __B}; return (__m128i)__tmp; } I'm not sure that will work all the time either. The only other approach I an aware of would be something like this: #typedef union { __m128i vi; __m128d vd; int s[4]; unsigned int us[4]; /* etc. for other types */ } emm_universal ; #define EMM_SINT4(a) (a).s static __inline __m128i __attribute__((__always_inline__)) _mm_slli_epi32 (__m128i __A, int __B) { emm_universal A; A.vi = __A; __v4si __tmp = { EMM_SINT4(A)[0] << __B, EMM_SINT4(A)[1] << __B, EMM_SINT4(A)[2] << __B, EMM_SINT4(A)[3] << __B}; return (__m128i)__tmp; } The union approach seems to be just a different a way to spin the pointer operations. For gcc in particular, is one approach or the other to be preferred and why? Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
vector extension bug?
I tried to track down the bug mentioned previously in testing my software SSE2 when compiled with -m64 and ended up removing all of the CHECK and my own includes without eliminating the bug. The test program works fine with -m32, or with -m64 -msse2, but it fails with -m64 -mno-sse2. Here is the greatly reduced gccprob2.c: 8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8< #include /* for printf */ typedef double__m128d __attribute__ ((__vector_size__ (16), __may_alias__)); typedef union { __m128d x; double a[2]; } union128d; #define EMM_FLT8(a)((double *)&(a)) void test ( __m128d s1, __m128d s2) { printf("test s1 %lf %lf\n",EMM_FLT8(s1)[0],EMM_FLT8(s1)[1]); printf("test s2 %lf %lf\n",EMM_FLT8(s2)[0],EMM_FLT8(s2)[1]); } int main (void) { __attribute__ ((aligned (16))) union128d s1; s1.a[0] = 1.0; s1.a[1] = 2.0; printf("s1 %lf %lf\n",s1.a[0],s1.a[1]); test (s1.x, s1.x); } 8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8< Test runs: % gcc -msse -mno-sse2 -m64 -o foo gccprob2.c % ./foo #first value in s2 is wrong s1 1.00 2.00 test s1 1.00 2.00 test s2 2.00 2.00 % gcc -msse -msse2 -m64 -o foo gccprob2.c % ./foo s1 1.00 2.00 test s1 1.00 2.00 test s2 1.00 2.00 % gcc -msse -mno-sse2 -lm -m32 -o foo gccprob2.c % ./foo s1 1.00 2.00 test s1 1.00 2.00 test s2 1.00 2.00 % gcc --version gcc (GCC) 4.4.1 % cat /etc/release Mandriva Linux release 2010.0 (Official) for x86_64 % cat /proc/cpuinfo | head -10 processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 280 stepping: 2 cpu MHz : 1000.000 cache size : 1024 KB physical id : 0 siblings: 2 Is there something wrong with this program or is this a compiler bug? Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Strangeness with SSE(1)
I'm seeing the oddest thing with a function compiled like: mpicc -std=gnu99 -O1 -g -m32 -pthread -msse -mno-sse2 -DHAVE_CONFIG_H -I../../easel -I../../easel -I. -I.. -I. -I../../src -o fwdback.o -c fwdback.c using both gcc versions gcc (GCC) 4.4.1 (on a 64 bit linux) gcc (GCC) 4.2.3 (4.2.3-6mnb1) (on a 32 bit linux) on OMPI 1.4.3. The compilers are on Opterons, the worker node where it fails is an Athlon MP. (Shouldn't be any differene with -mno-sse2 off, right?) Basically it comes down to (many lines of code omitted) register __m128 xEv; fprintf(stderr,"DEBUG0 xEV %lld\n",xEv);fflush(stderr); xEv = _mm_setzero_ps(); fprintf(stderr,"DEBUGB xEV %lld\n",xEv);fflush(stderr); /* problem */ throwing an error when run in Valgrind in a particular program at the second printf. ==13053== Conditional jump or move depends on uninitialised value(s) ==13053==at 0x4BE50BC: vfprintf (in /lib/libc-2.10.1.so) ==13053==by 0x4BE9411: ??? (in /lib/libc-2.10.1.so) ==13053==by 0x4BE4492: vfprintf (in /lib/libc-2.10.1.so) ==13053==by 0x4BEE7CE: fprintf (in /lib/libc-2.10.1.so) ==13053==by 0x807FC19: forward_engine (fwdback.c:305) < ==13053==by 0x8080289: p7_ForwardParser (fwdback.c:143) ==13053==by 0x8075B08: p7_Tau (evalues.c:442) ==13053==by 0x8076554: p7_Calibrate (evalues.c:109) ==13053==by 0x8061815: calibrate (p7_builder.c:629) ==13053==by 0x80618D6: p7_SingleBuilder (p7_builder.c:393) ==13053==by 0x80570C9: main (jackhmmer.c:1068) How can xEv possibly be uninitialized in that position? Note the problem initially manifested much further down in the code here _mm_store_ss(&xE, xEv); As far as Valgrind is concerned it starts right after the _mm_setzero_ps(). After my signature is that function from the start down to the DEBUGB line with all lines present - sorry about the wrap: David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech - static int forward_engine(int do_full, const ESL_DSQ *dsq, int L, const P7_OPROFILE *om, P7_OMX *ox, float *opt_sc) { register __m128 mpv, dpv, ipv; /* previous row values */ register __m128 sv; /* temp storage of 1 curr row value in progress */ register __m128 dcv; /* delayed storage of D(i,q+1) */ register __m128 xEv; /* E state: keeps max for Mk->E as we go */ register __m128 xBv; /* B state: splatted vector of B[i-1] for B->Mk calculations */ __m128 zerov; /* splatted 0.0's in a vector */ floatxN, xE, xB, xC, xJ; /* special states' scores */ int i; /* counter over sequence positions 1..L */ int q; /* counter over quads 0..nq-1 */ int j; /* counter over DD iterations (4 is full serialization) */ int Q = p7O_NQF(om->M);/* segment length: # of vectors */ __m128 *dpc = ox->dpf[0];/* current row, for use in {MDI}MO(dpp,q) access macro */ __m128 *dpp; /* previous row, for use in {MDI}MO(dpp,q) access macro */ __m128 *rp; /* will point at om->rfv[x] for residue x[i] */ __m128 *tp; /* will point into (and step thru) om->tfv */ /* Initialization. */ ox->M = om->M; ox->L = L; ox->has_own_scales = TRUE;/* all forward matrices control their own scalefactors */ zerov = _mm_setzero_ps(); for (q = 0; q < Q; q++) MMO(dpc,q) = IMO(dpc,q) = DMO(dpc,q) = zerov; xE= ox->xmx[p7X_E] = 0.; xN= ox->xmx[p7X_N] = 1.; xJ= ox->xmx[p7X_J] = 0.; xB= ox->xmx[p7X_B] = om->xf[p7O_N][p7O_MOVE]; xC= ox->xmx[p7X_C] = 0.; ox->xmx[p7X_SCALE] = 1.0; ox->totscale = 0.0; #if p7_DEBUGGING if (ox->debugging) p7_omx_DumpFBRow(ox, TRUE, 0, 9, 5, xE, xN, xJ, xB, xC);/* logify=TRUE, =0, width=8, precision=5*/ #endif for (i = 1; i <= L; i++) { fprintf(stderr,"DEBUGA i %d\n",i);fflush(stderr); dpp = dpc; dpc = ox->dpf[do_full * i]; /* avoid conditional, use do_full as kronecker delta */ rp= om->rfv[dsq[i]]; tp= om->tfv; dcv = _mm_setzero_ps(); xEv = _mm_setzero_ps(); fprintf(stderr,"DEBUGB xEV %lld\n",xEv);fflush(stderr);
Re: Strangeness with SSE(1)
Can somebody please explain the behavior of the following program to me? cat >test.c < #include #include int main(void){ register __m128 var; fprintf(stdout,"pre %X\n",var); var = _mm_setzero_ps(); fprintf(stdout,"post %X\n",var); fprintf(stdout,"zerof %X\n",0.0f); exit(EXIT_SUCCESS); } EOD gcc -O0 -g -std=gnu99 -msse -mno-sse2 -m32 -o test test.c ./test pre FFC5FC98 post FFC5FC98 zerof 0 gcc --version gcc (GCC) 4.4.1 Now I know that var is 16 bytes, not 4, so the %X isn't appropriate to show all of it, but apparently it doesn't show any of it, since any 4 bytes out of the 16 should have been 0. Plus if this is run in valgrind there are Conditional jump or move depends on uninitialised value(s) at both of the print statements that access var. Adding -Wall test.c:7: warning: format '%X' expects type 'unsigned int', but argument 3 has type '__m128' test.c:9: warning: format '%X' expects type 'unsigned int', but argument 3 has type '__m128' test.c:10: warning: format '%X' expects type 'unsigned int', but argument 3 has type 'double' test.c:7: warning: 'var' is used uninitialized in this function But the last one goes away if the first fprintf is commented out. So gcc passes _something_ for var to fprintf, but what? Thanks, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech