Method to test all sse2 calls?

2010-11-19 Thread David Mathog
Hi,

I just finished coding a software implementation of emmintrin.h.  It
removes all of the builtins and uses inlined C functions instead.  This
is to allow SSE2 based code to run, albeit slowly, on machines without
SSE2 support.  I am looking for a program, script, or whatever that can
be used to test all 200 odd _mm_* SSE2 functions to make sure that they
actually work right.  Presumably such a thing must be included in the
gcc distribution, so that a new build can be verified to work properly.
 However I have no clue where it would be or how to use it.  Can
somebody please point me in the right direction?

Thank you,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Method to disable code SSE2 generation but still use -msse2

2010-11-22 Thread David Mathog
My software implementation of SSE2 now passes all the testsuite
programs. In case anybody else ever needs this, it is here: 

http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/soft_emmintrin.h

I compiled that with a target program and gprof showed
all the time in resulting binary in the inlined functions.  It ran about
4X slower than the SSE2 hardware version, which is about what I
expected.  So, so far so good.  What I am worried about now is that
since it was invoked with "-msse2" the compiler may still be generating
SSE2 calls within the inlined functions.  Is there a way to definitively
disable this but still retain -msse2 on the command line?  

For instance, here is one of the software version inline functions:

/*  vector subtract the two doubles in an __m128d  */
static __inline __m128d __attribute__((__always_inline__))
_mm_sub_pd (__m128d __A, __m128d __B)
{
  return (__m128d)((__v2df)__A - (__v2df)__B);
}

In the original gcc emmintrin.h that called a builtin _explicitly_.  I
also want to avoid having the compiler use the same builtin
_implicitly_.  If it uses SSE, 3DNOW or MMX implicitly, in this example,
that would be fine, it just cannot use any SSE2 hardware.

Actually, one thing I was never very clear on, do -msse2 -m3dnow
etc. only provide access to the corresponding machine operations through
the _mm* (or whatever) definitions in the header file, or does the
compiler also figure out vector operations by itself during the
optimization phase of compilation?

Thank you,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to disable code SSE2 generation but still use -msse2

2010-11-22 Thread David Mathog
Ian Lance Taylor wrote:

> No.  If I understand what you are doing, I don't think you want to use
> -msse2 at all.  In fact I think you want -mno-sse2.

Following your suggestion mo-sse2 was tried, which generated an error
message well beyond my comprehension:

gcc -std=gnu99 -g -pg -pthread -O4 -DSOFT_SSE2 -msse -mno-sse2 
-DHAVE_CONFIG_H  -I../../easel -I../../easel -I. -I.. -I. -I../../src -o
msvfilter.o -c msvfilter.c
msvfilter.c: In function 'p7_MSVFilter':
msvfilter.c:208: error: unable to find a register to spill in class
'GENERAL_REGS'
msvfilter.c:208: error: this is the insn:
(insn:HI 3569 3568 3570 302 ../../easel/emmintrin.h:2334 (set
(strict_low_part (subreg:HI (reg:TI 1514) 0))
(mem:HI (plus:SI (reg/f:SI 20 frame)
(const_int -30 [0xffe2])) [14 S2 A16])) 40
{*movstricthi_1} (insn_list:REG_DEP_TRUE 3568 (nil))
(nil))
msvfilter.c:208: confused by earlier errors, bailing out
make: *** [msvfilter.o] Error 1

line 208 in msvfilter.c is the closing "}" on the p7_MSVFilter function.

line 2334 in emmintrin.h is the return statement in the snippet below

static __inline __m128i __attribute__((__always_inline__))
 _mm_shufflelo_epi16(__m128i __A, int __B){
 __v8hi __tmp  = { EMM_UINT2(__A)[__B& 3],
   EMM_UINT2(__A)[__B>>2 & 3],
   EMM_UINT2(__A)[__B>>4 & 3],
   EMM_UINT2(__A)[__B>>6 & 3],
   EMM_UINT2(__A)[4],
   EMM_UINT2(__A)[5],
   EMM_UINT2(__A)[6],
   EMM_UINT2(__A)[7]};
  return (__m128i) __tmp;
}

where HMM_UINT2 is this:

#define EMM_UINT2(a)   ((unsigned short *)&(a))

If -mno-sse2 is changed to -msse2 that compile completes without errors
or warnings.

gcc --version is: gcc (GCC) 4.2.3 (4.2.3-6mnb1)

What does that compiler error mean?

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to disable code SSE2 generation but still use -msse2

2010-11-23 Thread David Mathog
The last mysterious error message went away when the same code was
compiled on a machine with a more recent gcc (4.4.1).  Shortly after
I hit the next roadblock.

Here is foo.c (a modified version of sse2-cmpsd-1.c from the version
4.5.1 testsuite):

>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8
#ifndef CHECK_H
#define CHECK_H "sse2-check.h"
#endif

#ifndef TEST
#define TEST sse2_test
#endif

#include CHECK_H

#include 

static __m128d
__attribute__((noinline, unused))
test (__m128d s1, __m128d s2)
{
printf("test s1.x"); _mm_dump_fd(s1);
printf("test s2.x"); _mm_dump_fd(s2);
  return _mm_add_pd (s1, s2); 
}

static void
TEST (void)
{
  union128d u, s1, s2;
  double e[2];
   
  s1.x = _mm_set_pd (2134.3343,1234.635654);
  s2.x = _mm_set_pd (41124.234,2344.2354);
printf("s10 1 %lf %lf\n",s1.a[0],s1.a[1]);
printf("s20 1 %lf %lf\n",s2.a[0],s2.a[1]);
printf("s1.x"); _mm_dump_fd(s1.x);
printf("s2.x"); _mm_dump_fd(s2.x);
  u.x = test (s1.x, s2.x); 
   
  e[0] = s1.a[0] + s2.a[0];
  e[1] = s1.a[1] + s2.a[1];

printf("s1.x"); _mm_dump_fd(s1.x);
printf("s2.x"); _mm_dump_fd(s2.x);
printf("expected e0 e1 %lf %lf\n",e[0],e[1]);
printf("result   r0 r1 %lf %lf\n",u.a[0],u.a[1]);

  if (check_union128d (u, e))
abort ();
}
>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8>8>8<8

When compiled with -mno-sse2 the run fails.  Bizarrely, it seems to be
passing data into the test function incorrectly, notice that in test
the low double in s2 is the high double in s1, instead of the original
low double in s2 from outside the calling function.  This erroneous
value propagates into my inline code where it is added (correctly, but
of course to the wrong final sum since the inputs were wrong).

gcc -Wall -msse -mno-sse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG -O1  -o
foo_wno foo.c
./foo_wno
mm_set_pd, in 2134.334300 1234.635654
mm_set_pd, in 41124.234000 2344.235400
s10 1 1234.635654 2134.334300
s20 1 2344.235400 41124.234000
s1.xDEBUG m_d_fd:   1234.635654  2134.334300
s2.xDEBUG m_d_fd:   2344.235400 41124.234000
test s1.xDEBUG m_d_fd:   1234.635654  2134.334300
test s2.xDEBUG m_d_fd:   2134.334300 41124.234000
IN _mm_add_pd
__ADEBUG m_d_fd:   1234.635654  2134.334300
__BDEBUG m_d_fd:   2134.334300 41124.234000
s1.xDEBUG m_d_fd:   1234.635654  2134.334300
s2.xDEBUG m_d_fd:   2344.235400 41124.234000
expected e0 e1 3578.871054 43258.568300
result   r0 r1 3368.969954 43258.568300
Aborted

when -msse2 is enabled however, the parameters are passed appropriately
into test (and my inlined function), and the program works.  Here
the pass to the test function is correct, and that propagates into my
inline function correctly too:

gcc -Wall -msse -msse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG -O1  -o
foo_nono foo.c
[r...@newsaf i386]# ./foo_nono
mm_set_pd, in 2134.334300 1234.635654
mm_set_pd, in 41124.234000 2344.235400
s10 1 1234.635654 2134.334300
s20 1 2344.235400 41124.234000
s1.xDEBUG m_d_fd:   1234.635654  2134.334300
s2.xDEBUG m_d_fd:   2344.235400 41124.234000
test s1.xDEBUG m_d_fd:   1234.635654  2134.334300
test s2.xDEBUG m_d_fd:   2344.235400 41124.234000
IN _mm_add_pd
__ADEBUG m_d_fd:   1234.635654  2134.334300
__BDEBUG m_d_fd:   2344.235400 41124.234000
s1.xDEBUG m_d_fd:   1234.635654  2134.334300
s2.xDEBUG m_d_fd:   2344.235400 41124.234000
expected e0 e1 3578.871054 43258.568300
result   r0 r1 3578.871054 43258.568300

Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to disable code SSE2 generation but still use -msse2

2010-11-23 Thread David Mathog
I have found several ways to "fix" the latest issue, but they all boil
down to never passing an __m128d value on the call stack.  For instance
change

static __m128d
__attribute__((noinline, unused))
test (__m128d s1, __m128d s2)

to

static __m128d test (__m128d s1, __m128d s2)

and the program works.  Similarly, change the function to

 static __m128d __attribute__((noinline)) test (__m128d *s1, __m128d *s2)
{
  return _mm_add_pd (*s1, *s2); 
}

and it also works.

Things I tried to force a 16 byte stack alignment that didn't work:

1  -mstackrealign
2  -mpreferred-stack-boundary=4
3  -mincoming-stack-boundary=4
4  2 and 3
5  1 and 2 and 3

I guess the bigger question is why can an __m128d be passed on the call
stack reliably when -msse2 is invoked, but not otherwise?  If the
compiler cannot do this reliably shouldn't it throw an error or warning?

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to disable code SSE2 generation but still use -msse2

2010-11-23 Thread David Mathog
> Things I tried to force a 16 byte stack alignment that didn't work:
> 
> 1  -mstackrealign
> 2  -mpreferred-stack-boundary=4
> 3  -mincoming-stack-boundary=4
> 4  2 and 3
> 5  1 and 2 and 3

And this is why they didn't work.  Change the test function to

 static __m128d __attribute__((noinline,aligned (16))) test ( __m128d
s1, __m128d s2)
{
printf("test s1"); _mm_dump_fd(s1);
printf("test s2"); _mm_dump_fd(s2);
printf("loc s1 %p\n",&s1);
printf("loc s2 %p\n",&s2);
  return _mm_add_pd (s1, s2); 
}

compile and run

 gcc -Wall -msse -mno-sse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG  -O1  -o
foo_wno foo.c
[r...@newsaf i386]# ./foo_wno
mm_set_pd, in 2134.334300 1234.635654
mm_set_pd, in 41124.234000 2344.235400
s10 1 1234.635654 2134.334300
s20 1 2344.235400 41124.234000
s1.xDEBUG m_d_fd:   1234.635654  2134.334300
s2.xDEBUG m_d_fd:   2344.235400 41124.234000
test s1DEBUG m_d_fd:   1234.635654  2134.334300
test s2DEBUG m_d_fd:   2134.334300 41124.234000
loc s1 0x7fff6b6ccb10   <--
loc s2 0x7fff6b6ccb00   <--
s1.xDEBUG m_d_fd:   1234.635654  2134.334300
s2.xDEBUG m_d_fd:   2344.235400 41124.234000
expected e0 e1 3578.871054 43258.568300
result   r0 r1 3368.969954 43258.568300
Aborted

s1 and s2 within test are already 16 byte aligned, so the extra
alignment switches did not help.  Somehow this code

  u.x = test (s1.x, s2.x);

is putting the wrong values for s2 onto the call stack.

Bizarre.  Either I'm missing something or turning off SSE2 is uncovering
a compiler bug.

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to disable code SSE2 generation but still use -msse2

2010-11-23 Thread David Mathog
I renamed the test case gccprob.c and made two binaries and two
assembler files:

gcc -Wall -msse -mno-sse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \
   -O0  -o gccprob_wno gccprob.c
gcc -Wall -msse -mno-sse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG  \
   -O0 -S  -o gccprob_wno.s gccprob.c
gcc -Wall -msse -msse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \
   -O0 -S  -o gccprob_nono.s gccprob.c
gcc -Wall -msse -msse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \
   -O0  -o gccprob_nono gccprob.c

The _wno variants have the problem passing __m128d on the stack,
the _nono varients do not.

packed up all 5 files and put them here (retrieve only directory, no
directory listings in pickup):

http://saf.bio.caltech.edu/pub/pickup/gccprob.tar.gz

I am not an assembler programmer.  If one of you who is could have a
look at the two .s files maybe we can get to the bottom of this.

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to disable code SSE2 generation but still use -msse2

2010-11-23 Thread David Mathog
The problem is specific for 64 bit environments, made these:

gcc -Wall -msse -mno-sse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG \
   -O0 -m32 -S  -o gccprob_wno32.s gccprob.c
gcc -Wall -msse -mno-sse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG  \
   -O0 -m32  -o gccprob_wno32 gccprob.c
gcc -Wall -msse -msse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG  \
   -O0 -m32  -o gccprob_nono32 gccprob.c
gcc -Wall -msse -msse2  -I. -lm -DSOFT_SSE2 -DEMMSOFTDBG  \
   -O0 -m32 -S  -o gccprob_nono32.s gccprob.c

and both binaries work correctly.  Added them to the set here:

http://saf.bio.caltech.edu/pub/pickup/gccprob.tar.gz

Specifics on the environment where the problem is seen:

OS:  Mandriva Linux release 2010.0 (Official) for x86_64
gcc (GCC) 4.4.1
Dual Dual Core Opteron 280. 
Arima HDAMAI motherboard.
64 bit targets only, 32 bit is OK.

Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to test all sse2 calls?

2010-11-23 Thread David Mathog
What is:

  __builtin_ia32_vec_ext_v2df

???  It wasn't in the original emmintrin.h, so presumably isn't actually
part of SSE2, but it is present in the testsuite, and it is not visible
to the compiler when -mno-sse2 is set. See for instance the files
sse2-vec-#.c.  (Randomly selected) Example:

sse2-vec-4.c:  res[2] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 2);

 gcc -Wall -msse -mno-sse2 -I. -m32 -lm -DSOFT_SSE2 -o foo sse2-vec-4.c
sse2-vec-4.c: In function 'sse2_test':
sse2-vec-4.c:27: warning: implicit declaration of function
'__builtin_ia32_vec_ext_v8hi'
/root/tmp/ccYAq3IB.o: In function `sse2_test':
sse2-vec-4.c:(.text+0x58c): undefined reference to
`__builtin_ia32_vec_ext_v8hi'
.
.
.
/root/tmp/ccYAq3IB.o:sse2-vec-4.c:(.text+0x613): more undefined
references to `__builtin_ia32_vec_ext_v8hi' follow
collect2: ld returned 1 exit status

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: Method to test all sse2 calls?

2010-11-24 Thread David Mathog
Ian Lance Taylor , wrote:

> Tests that directly invoke __builtin functions are not appropriate for
> your replacement for emmintrin.h.

Clearly.  However, I do not see why these are in the test routines in
the first place.  They seem not to be needed.  I made the changes below
my signature, eliminating all of the vector builtins, and the programs
still worked with both -msse2 and -mno-sse2 plus my software SSE2.  If
anything the test programs are much easier to understand without the
builtins.

There is also a (big) problem with sse2-vec-2.c (and -2a, which is empty
other than an #include sse2-vec-2.c).  There are no explicit sse2
operations within this test program.  Moreover, the code within the
tests does not work.  Finally, if one puts a print statement anywhere in
the test that is there, compiles it with:

 gcc -msse -msse2

there will be no warnings, and the run will appear to show a valid test,
but in actuality the test will never execute! This shows part of the
problem:

gcc -Wall -msse -msse2 -o foo sse2-vec-2.c
sse-os-support.h:27: warning: 'sse_os_support' defined but not used
sse2-check.h:10: warning: 'do_test' defined but not used

(also for -m64) There must be some sort of main in there, but no test,
it does nothing and returns a valid status.

When stuffed with debug statements:

  for (i = 0; i < 2; i++)
masks[i] = i;

printf("DEBUG res[0] %llX\n",res[0]);
printf("DEBUG res[1] %llX\n",res[1]);
printf("DEBUG val1.ll[0] %llX\n",val1.ll[0]);
printf("DEBUG val1.ll[1] %llX\n",val1.ll[1]);
  for (i = 0; i < 2; i++)
if (res[i] != val1.ll [masks[i]]){
printf("DEBUG i %d\n",i);
printf("DEBUG masks[i] %d\n",masks[i]);
printf("DEBUG val1.ll [masks[i]] %llX\n", val1.ll [masks[i]]);
  abort ();
}

and compiled with my software SSE2 

gcc -Wall -msse -mno-sse2 -I. -O0 -m32 -lm -DSOFT_SSE2 -DEMMSOFTDBG -o
foo   sse2-vec-2.c

It emits:

DEBUG res[0] 3020100
DEBUG res[1] 7060504
DEBUG val1.ll[0] 706050403020100
DEBUG val1.ll[1] F0E0D0C0B0A0908
DEBUG i 0
DEBUG masks[i] 0
DEBUG val1.ll [masks[i]] 706050403020100
Aborted

True enough 3020100 != 706050403020100, but what kind of test
is that???

Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

changes to sse2-vec-*.c routines to eliminate all of the __builtin
calls:

ls -1 sse2-vec*dist | grep -v vec-2 | extract -cols 'diff --context=0
[1,-6] [1,]' | execinput
*** sse2-vec-1.c2010-11-24 09:06:46.0 -0800
--- sse2-vec-1.c.dist   2010-11-24 09:06:39.0 -0800
***
*** 27,28 
!   res[0] = val1.d[msk0];
!   res[1] = val1.d[msk1];
--- 27,28 
!   res[0] = __builtin_ia32_vec_ext_v2df ((__v2df)val1.x, msk0);
!   res[1] = __builtin_ia32_vec_ext_v2df ((__v2df)val1.x, msk1);
*** sse2-vec-3.c2010-11-24 09:09:13.0 -0800
--- sse2-vec-3.c.dist   2010-11-24 09:07:48.0 -0800
***
*** 27,30 
!   res[0] = val1.i[0];
!   res[1] = val1.i[1];
!   res[2] = val1.i[2];
!   res[3] = val1.i[3];
--- 27,30 
!   res[0] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 0);
!   res[1] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 1);
!   res[2] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 2);
!   res[3] = __builtin_ia32_vec_ext_v4si ((__v4si)val1.x, 3);
*** sse2-vec-4.c2010-11-24 09:10:00.0 -0800
--- sse2-vec-4.c.dist   2010-11-24 09:07:48.0 -0800
***
*** 27,34 
!   res[0] = val1.s[0];
!   res[1] = val1.s[1];
!   res[2] = val1.s[2];
!   res[3] = val1.s[3];
!   res[4] = val1.s[4];
!   res[5] = val1.s[5];
!   res[6] = val1.s[6];
!   res[7] = val1.s[7];
--- 27,34 
!   res[0] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 0);
!   res[1] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 1);
!   res[2] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 2);
!   res[3] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 3);
!   res[4] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 4);
!   res[5] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 5);
!   res[6] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 6);
!   res[7] = __builtin_ia32_vec_ext_v8hi ((__v8hi)val1.x, 7);
*** sse2-vec-5.c2010-11-24 09:11:09.0 -0800
--- sse2-vec-5.c.dist   2010-11-24 09:07:48.0 -0800
***
*** 27,42 
!   res[0] = val1.c[0];
!   res[1] = val1.c[1];
!   res[2] = val1.c[2];
!   res[3] = val1.c[3];
!   res[4] = val1.c[4];
!   res[5] = val1.c[5];
!   res[6] = val1.c[6];
!   res[7] = val1.c[7];
!   res[8] = val1.c[8];
!   res[9] = val1.c[9];
!   res[10] = val1.c[10];
!   res[11] = val1.c[11];
!   res[12] = val1.c[12];
!   res[13] = val1.c[13];
!   res[14] = val1.c[14];
!   res[15] = val1.c[15];
--- 27,42 
!   res[0] = __builtin_ia32_vec_ext_v16qi ((__v16qi)val1.x, 0);
!   res[1] = __builtin_ia32_vec_ext_v16qi ((__v16qi)val1.

Re: Method to test all sse2 calls?

2010-11-24 Thread David Mathog
Ian Lance Taylor  wrote:

> Your changes are relying on a gcc extension which was only recently
> added, more recently than those tests were added to the testsuite.  Only
> recently did gcc acquire the ability to use [] to access elements in a
> vector. 

That isn't what my changes did. The array accesses are to the arrays in
the union - nothing cutting edge there.  The data is accessed through
the array specified by .d (or .s etc.) not to name.x[index].


> So I think you may have misinterpreted the __builtin_ia32_vec_ext_v2di
> builtin function.  That function treats the vector as containing two
> 8-byte integers, and pulls out one or the other depending on the second
> argument.  Your dumps of res[0] and res[1] suggest that you are treating
> the vector as four 4-byte integers and pulling out specific ones.

Yup, my bad, put in d where it should have been ll.  Also fixed the
problem I induced in sse2-check.h, where too large a chunk was commented
out, that was causing the gcc -Wall -msse2 problem.  The changed part in
the original source was

  if ((edx & bit_SSE2) && sse_os_support ())

and is now:

#if !defined(SOFT_SSE2)
  if ((edx & bit_SSE2) && sse_os_support ())
#else
  if (sse_os_support ())
#endif /*SOFT_SSE2*/

My software SSE2 passes all 165 of the sse2 tests that are complete
programs.

However, there is a problem in the real world.  While the sse2 programs
in the testsuite do exercise the _mm* functions, they do so one at a
time.  I have found that in real code, which makes multiple _mm* calls,
if -O0 is not used, the wrong results (may) come out.  

% gcc -std=gnu99 -g -pg -pthread -O0 -msse -mno-sse2 -DSOFT_SSE2 -m32 -g
-pg -DHAVE_CONFIG_H -L../../easel -L.. -L. -I../../easel -I../../easel
-I. -I.. -I. -I../../src -Dp7MSVFILTER_TESTDRIVE -o msvfilter_utest
./msvfilter.c -Wl,--start-group -lhmmer -lhmmerimpl -Wall
-Wl,--end-group -leasel -lm
% ./msvfilter_utest
(no output, it ran correctly)

% gcc -std=gnu99 -g -pg -pthread -O1 -msse -mno-sse2 -DSOFT_SSE2 -m32 -g
-pg -DHAVE_CONFIG_H -L../../easel -L.. -L. -I../../easel -I../../easel
-I. -I.. -I. -I../../src -Dp7MSVFILTER_TESTDRIVE -o msvfilter_utest
./msvfilter.c -Wl,--start-group -lhmmer -lhmmerimpl -Wall
-Wl,--end-group -leasel -lm
% ./msvfilter_utest
msv filter unit test failed: scores differ (-50.37, -10.86)

Going to higher optimization and there are even bigger issues, like not
compiling at all (even with gcc 4.4.1):

% gcc -std=gnu99 -g -pg -pthread -O2 -msse -mno-sse2 -DSOFT_SSE2 -m32 -g
-pg -DHAVE_CONFIG_H -L../../easel -L.. -L. -I../../easel -I../../easel
-I. -I.. -I. -I../../src -Dp7MSVFILTER_TESTDRIVE -o msvfilter_utest
./msvfilter.c -Wl,--start-group -lhmmer -lhmmerimpl -Wall
-Wl,--end-group -leasel -lm
../../easel/emmintrin.h:2178: warning: dereferencing pointer
'({anonymous})' does break strict-aliasing rules
../../easel/emmintrin.h:2178: note: initialized from here
.
.  (same sort of message many many times)
.
./msvfilter.c:208: error: unable to find a register to spill in class
'GENERAL_REGS'
./msvfilter.c:208: error: this is the insn:
(insn 1944 1943 1945 46 ../../easel/emmintrin.h:2348 (set
(strict_low_part (subreg:HI (reg:TI 1239) 0))
(mem:HI (reg/f:SI 96 [ pretmp.1031 ]) [13 S2 A16])) 47
{*movstricthi_1} (nil))
./msvfilter.c:208: confused by earlier errors, bailing out

Would changing the use of inlined functions to defines let the compiler
digest it better?  For instance:

static __inline __m128i __attribute__((__always_inline__))
_mm_andnot_si128 (__m128i __A, __m128i __B)
{
  return (~__A) & __B;
}

becomes

#define _mm_andnot_si128(A,B)  (~A & B)

That approach will get really messy for the more complicated _mm*.

In general terms, can somebody give me a hint as to the sorts of things
that if found in inlined functions might cause the compiler to optimize
to invalid code?


Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


best data extraction for SSE2 emulation?

2010-11-29 Thread David Mathog
In my software SSE2 emulation I am currently using this sort of approach
to extract data fields out of __m128i and __m128d vectors:

#define EMM_SINT4(a)   ((int *)&(a))

static __inline __m128i __attribute__((__always_inline__))
_mm_slli_epi32 (__m128i __A, int __B)
{
  __v4si __tmp = {
EMM_SINT4(__A)[0] << __B,
EMM_SINT4(__A)[1] << __B,
EMM_SINT4(__A)[2] << __B,
EMM_SINT4(__A)[3] << __B};
  return (__m128i)__tmp;
}

This works fine when testing one _mm function at a time, but does not
work reliably in real programs unless -O0 is used.  I think at least
part of the problem is that once the function is inlined the parameter
__A is in some cases a  register variable, and the pointer method is not
valid there.  To get around that I'm think of introducing an explicit
local variable, like this:

static __inline __m128i __attribute__((__always_inline__))
_mm_slli_epi32 (__m128i __A, int __B)
{
__m128i A = __A;
  __v4si __tmp = {
EMM_SINT4(A)[0] << __B,
EMM_SINT4(A)[1] << __B,
EMM_SINT4(A)[2] << __B,
EMM_SINT4(A)[3] << __B};
  return (__m128i)__tmp;
}

I'm not sure that will work all the time either.  The only other
approach I an aware of would be something like this:

#typedef union {
  __m128i   vi;
  __m128d   vd;
  int   s[4];
  unsigned int  us[4];
  /* etc. for other types */
} emm_universal ;

#define EMM_SINT4(a)   (a).s

static __inline __m128i __attribute__((__always_inline__))
_mm_slli_epi32 (__m128i __A, int __B)
{
emm_universal A;
  A.vi = __A;
 __v4si __tmp = {
EMM_SINT4(A)[0] << __B,
EMM_SINT4(A)[1] << __B,
EMM_SINT4(A)[2] << __B,
EMM_SINT4(A)[3] << __B};
  return (__m128i)__tmp;
}

The union approach seems to be just a different a way to spin the
pointer operations.  For gcc in particular, is one approach or the other
to be preferred and why?

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


vector extension bug?

2010-11-29 Thread David Mathog
I tried to track down the bug mentioned previously in testing my
software SSE2 when compiled with -m64 and ended up removing all 
of the CHECK and my own includes without eliminating the bug.  The test
program works fine with -m32, or with -m64 -msse2, but it fails with
-m64 -mno-sse2.  Here is the greatly reduced gccprob2.c:

8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<
#include  /* for printf */
typedef double__m128d __attribute__ ((__vector_size__ (16),
__may_alias__));
typedef union
{
  __m128d x;
  double a[2];
} union128d;
#define EMM_FLT8(a)((double *)&(a))

void test ( __m128d s1, __m128d s2)
{
printf("test s1 %lf %lf\n",EMM_FLT8(s1)[0],EMM_FLT8(s1)[1]);
printf("test s2 %lf %lf\n",EMM_FLT8(s2)[0],EMM_FLT8(s2)[1]);
}

int main (void)
{
__attribute__ ((aligned (16)))  union128d s1;
  s1.a[0] = 1.0;
  s1.a[1] = 2.0;
printf("s1  %lf %lf\n",s1.a[0],s1.a[1]);
  test (s1.x, s1.x);
}
8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<8<

Test runs:

% gcc -msse -mno-sse2 -m64   -o foo gccprob2.c
% ./foo  #first value in s2 is wrong
s1  1.00 2.00
test s1 1.00 2.00
test s2 2.00 2.00
% gcc -msse -msse2 -m64   -o foo gccprob2.c
% ./foo
s1  1.00 2.00
test s1 1.00 2.00
test s2 1.00 2.00
% gcc -msse -mno-sse2 -lm -m32   -o foo gccprob2.c
% ./foo
s1  1.00 2.00
test s1 1.00 2.00
test s2 1.00 2.00
% gcc --version
gcc (GCC) 4.4.1
% cat /etc/release
Mandriva Linux release 2010.0 (Official) for x86_64
% cat /proc/cpuinfo | head -10 
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 15
model   : 33
model name  : Dual Core AMD Opteron(tm) Processor 280
stepping: 2
cpu MHz : 1000.000
cache size  : 1024 KB
physical id : 0
siblings: 2

Is there something wrong with this program or is this a compiler bug?

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Strangeness with SSE(1)

2011-01-20 Thread David Mathog
I'm seeing the oddest thing with a function compiled like:

mpicc -std=gnu99 -O1 -g -m32 -pthread -msse -mno-sse2  -DHAVE_CONFIG_H 
-I../../easel -I../../easel -I. -I.. -I. -I../../src -o fwdback.o -c
fwdback.c

using both gcc versions

gcc (GCC) 4.4.1  (on a 64 bit linux)
gcc (GCC) 4.2.3 (4.2.3-6mnb1) (on a 32 bit linux)
on OMPI 1.4.3. 

The compilers are on Opterons, the worker node where it fails is
an Athlon MP.  (Shouldn't be any differene with -mno-sse2 off, right?)

Basically it comes down to (many lines of code omitted)

  register __m128 xEv;
fprintf(stderr,"DEBUG0 xEV %lld\n",xEv);fflush(stderr);
  xEv   = _mm_setzero_ps();
fprintf(stderr,"DEBUGB xEV %lld\n",xEv);fflush(stderr); /* problem */

throwing an error when run in Valgrind in a particular program at the
second printf.

==13053== Conditional jump or move depends on uninitialised value(s)
==13053==at 0x4BE50BC: vfprintf (in /lib/libc-2.10.1.so)
==13053==by 0x4BE9411: ??? (in /lib/libc-2.10.1.so)
==13053==by 0x4BE4492: vfprintf (in /lib/libc-2.10.1.so)
==13053==by 0x4BEE7CE: fprintf (in /lib/libc-2.10.1.so)
==13053==by 0x807FC19: forward_engine (fwdback.c:305) <
==13053==by 0x8080289: p7_ForwardParser (fwdback.c:143)
==13053==by 0x8075B08: p7_Tau (evalues.c:442)
==13053==by 0x8076554: p7_Calibrate (evalues.c:109)
==13053==by 0x8061815: calibrate (p7_builder.c:629)
==13053==by 0x80618D6: p7_SingleBuilder (p7_builder.c:393)
==13053==by 0x80570C9: main (jackhmmer.c:1068)

How can xEv possibly be uninitialized in that position?  Note the
problem initially manifested much further down in the code here

  _mm_store_ss(&xE, xEv);

As far as Valgrind is concerned it starts right after the _mm_setzero_ps().

After my signature is that function from the start down to the DEBUGB
line with all lines present - sorry about the wrap:

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

-

static int
forward_engine(int do_full, const ESL_DSQ *dsq, int L, const P7_OPROFILE
*om, P7_OMX *ox, float *opt_sc)
{
  register __m128 mpv, dpv, ipv;   /* previous row values  
*/
  register __m128 sv;  /* temp storage of 1 curr row value in
progress  */
  register __m128 dcv; /* delayed storage of D(i,q+1)  
*/
  register __m128 xEv; /* E state: keeps max for Mk->E as we go
*/
  register __m128 xBv; /* B state: splatted vector of B[i-1] for
B->Mk calculations */
  __m128   zerov;  /* splatted 0.0's in a vector

   */
  floatxN, xE, xB, xC, xJ; /* special states' scores
   */
  int i;   /* counter over sequence positions 1..L  

   */
  int q;   /* counter over quads 0..nq-1

   */
  int j;   /* counter over DD iterations (4 is full 
serialization)  
   */
  int Q   = p7O_NQF(om->M);/* segment length: # of vectors 
*/
  __m128 *dpc = ox->dpf[0];/* current row, for use in
{MDI}MO(dpp,q) access macro   */
  __m128 *dpp; /* previous row, for use in
{MDI}MO(dpp,q) access macro  */
  __m128 *rp;  /* will point at om->rfv[x] for residue x[i] 
   
*/
  __m128 *tp;  /* will point into (and step thru) om->tfv   
   
*/

  /* Initialization. */
  ox->M  = om->M;
  ox->L  = L;
  ox->has_own_scales = TRUE;/* all forward matrices control their own
scalefactors */
  zerov  = _mm_setzero_ps();
  for (q = 0; q < Q; q++)
MMO(dpc,q) = IMO(dpc,q) = DMO(dpc,q) = zerov;
  xE= ox->xmx[p7X_E] = 0.;
  xN= ox->xmx[p7X_N] = 1.;
  xJ= ox->xmx[p7X_J] = 0.;
  xB= ox->xmx[p7X_B] = om->xf[p7O_N][p7O_MOVE];
  xC= ox->xmx[p7X_C] = 0.;

  ox->xmx[p7X_SCALE] = 1.0;
  ox->totscale   = 0.0;

#if p7_DEBUGGING
  if (ox->debugging) p7_omx_DumpFBRow(ox, TRUE, 0, 9, 5, xE, xN, xJ,
xB, xC);/* logify=TRUE, =0, width=8, precision=5*/
#endif

  for (i = 1; i <= L; i++)
{
fprintf(stderr,"DEBUGA i %d\n",i);fflush(stderr);
  dpp   = dpc;  
  dpc   = ox->dpf[do_full * i]; /* avoid conditional, use
do_full as kronecker delta */
  rp= om->rfv[dsq[i]];
  tp= om->tfv;
  dcv   = _mm_setzero_ps();
  xEv   = _mm_setzero_ps();
fprintf(stderr,"DEBUGB xEV %lld\n",xEv);fflush(stderr);



Re: Strangeness with SSE(1)

2011-01-21 Thread David Mathog
Can somebody please explain the behavior of the following program
to me?

cat >test.c <
#include 
#include 

int main(void){
  register __m128 var;
  fprintf(stdout,"pre   %X\n",var);
  var   = _mm_setzero_ps();
  fprintf(stdout,"post  %X\n",var);
  fprintf(stdout,"zerof %X\n",0.0f);
  exit(EXIT_SUCCESS);
}
EOD
gcc -O0 -g -std=gnu99 -msse -mno-sse2 -m32 -o test test.c
./test
pre   FFC5FC98
post  FFC5FC98
zerof 0


gcc --version
gcc (GCC) 4.4.1

Now I know that var is 16 bytes, not 4, so the %X isn't appropriate to
show all of it, but apparently it doesn't show any of it, since any 4
bytes out of the 16 should have been 0.

Plus if this is run in valgrind there are

  Conditional jump or move depends on uninitialised value(s)

at both of the print statements that access var.

Adding -Wall
test.c:7: warning: format '%X' expects type 'unsigned int', but argument
3 has type '__m128'
test.c:9: warning: format '%X' expects type 'unsigned int', but argument
3 has type '__m128'
test.c:10: warning: format '%X' expects type 'unsigned int', but
argument 3 has type 'double'
test.c:7: warning: 'var' is used uninitialized in this function

But the last one goes away if the first fprintf is commented out.  So
gcc passes _something_ for var to fprintf, but what?

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech