On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Toon Moene

L.S.,

Due to the discussion on register allocation, I went back to a hobby of 
mine: Studying the assembly output of the compiler.


For this Fortran subroutine (note: unless otherwise told to the Fortran 
front end, reals are 32 bit floating point numbers):


  subroutine sum(a, b, c, n)
  integer i, n
  real a(n), b(n), c(n)
  do i = 1, n
 c(i) = a(i) + b(i)
  enddo
  end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

xorps   %xmm2, %xmm2

.L6:
movaps  %xmm2, %xmm0
movaps  %xmm2, %xmm1
movlps  (%r9,%rax), %xmm0
movlps  (%r8,%rax), %xmm1
movhps  8(%r9,%rax), %xmm0
movhps  8(%r8,%rax), %xmm1
incl%ecx
addps   %xmm1, %xmm0
movaps  %xmm0, 0(%rbp,%rax)
addq$16, %rax
cmpl%ebx, %ecx
jb  .L6

I'm not a master of x86_64 assembly, but this strongly looks like 
%xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with 
itself), before they are completely filled with the mov{l,h}ps 
instructions ?


Am I missing something ?

[ BTW, the induction variable %ecx could have been eliminated,
  because %rax also counts upwards (but 16 at a time instead of 1) ]

Thanks for any insight,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html


Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread H.J. Lu
On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:
> L.S.,
>
> Due to the discussion on register allocation, I went back to a hobby of
> mine: Studying the assembly output of the compiler.
>
> For this Fortran subroutine (note: unless otherwise told to the Fortran
> front end, reals are 32 bit floating point numbers):
>
>      subroutine sum(a, b, c, n)
>      integer i, n
>      real a(n), b(n), c(n)
>      do i = 1, n
>         c(i) = a(i) + b(i)
>      enddo
>      end
>
> with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:
>
>        xorps   %xmm2, %xmm2
>        
> .L6:
>        movaps  %xmm2, %xmm0
>        movaps  %xmm2, %xmm1
>        movlps  (%r9,%rax), %xmm0
>        movlps  (%r8,%rax), %xmm1
>        movhps  8(%r9,%rax), %xmm0
>        movhps  8(%r8,%rax), %xmm1
>        incl    %ecx
>        addps   %xmm1, %xmm0
>        movaps  %xmm0, 0(%rbp,%rax)
>        addq    $16, %rax
>        cmpl    %ebx, %ecx
>        jb      .L6
>
> I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1}
> have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before
> they are completely filled with the mov{l,h}ps instructions ?
>

I think it is used to avoid partial SSE register stall.


-- 
H.J.


Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Toon Moene

H.J. Lu wrote:

On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:

L.S.,

Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.

For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):

 subroutine sum(a, b, c, n)
 integer i, n
 real a(n), b(n), c(n)
 do i = 1, n
c(i) = a(i) + b(i)
 enddo
 end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

   xorps   %xmm2, %xmm2
   
.L6:
   movaps  %xmm2, %xmm0
   movaps  %xmm2, %xmm1
   movlps  (%r9,%rax), %xmm0
   movlps  (%r8,%rax), %xmm1
   movhps  8(%r9,%rax), %xmm0
   movhps  8(%r8,%rax), %xmm1
   incl%ecx
   addps   %xmm1, %xmm0
   movaps  %xmm0, 0(%rbp,%rax)
   addq$16, %rax
   cmpl%ebx, %ecx
   jb  .L6

I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before
they are completely filled with the mov{l,h}ps instructions ?



I think it is used to avoid partial SSE register stall.


You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for 
%xmm1) instruction (to copy 4*32 bits to the register) ?


--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html


i370 port - music/sp - possible generic gcc problem

2009-11-28 Thread Paul Edwards

I think I have found a bug in gcc, that still exists in gcc 4.4

I found the problem on 3.2.3 though.

While MVS and VM have basically been working fine, when I did
the port to MUSIC/SP I started getting strange compilation failures.

Initializing the stack to NULs made the problem go away, but I
persisted, and instead initialized the stack to x'01' to try to get
consistent failures.

Even with that, and the malloced memory initialized to consistent
garbage, I still didn't get the desired consistency in failures.
But it was consistent enough that I could just run it 6 times and
see if there were any failures.


Anyway, I tracked down the particular malloc() which gave changed
behaviour depending on whether the malloc() did a memory initialization
to NULs or not.

It's this one (showing 3.2.3):

C:\devel\gcc\gcc>cvs diff -l -c15
cvs diff: Diffing .
Index: ggc-page.c
===
RCS file: c:\cvsroot/gcc/gcc/ggc-page.c,v
retrieving revision 1.1.1.1
diff -c -1 -5 -r1.1.1.1 ggc-page.c
*** ggc-page.c  15 Feb 2006 10:22:25 -  1.1.1.1
--- ggc-page.c  28 Nov 2009 14:13:41 -
***
*** 640,670 
 #ifdef USING_MALLOC_PAGE_GROUPS
   else
 {
   /* Allocate a large block of memory and serve out the aligned
pages therein.  This results in much less memory wastage
than the traditional implementation of valloc.  */

   char *allocation, *a, *enda;
   size_t alloc_size, head_slop, tail_slop;
   int multiple_pages = (entry_size == G.pagesize);

   if (multiple_pages)
   alloc_size = GGC_QUIRE_SIZE * G.pagesize;
   else
   alloc_size = entry_size + G.pagesize - 1;
!   allocation = xmalloc (alloc_size);

   page = (char *) (((size_t) allocation + G.pagesize - 1) 
& -G.pagesize);

   head_slop = page - allocation;
   if (multiple_pages)
   tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1);
   else
   tail_slop = alloc_size - entry_size - head_slop;
   enda = allocation + alloc_size - tail_slop;

   /* We allocated N pages, which are likely not aligned, leaving
us with N-1 usable pages.  We plan to place the page_group
structure somewhere in the slop.  */
   if (head_slop >= sizeof (page_group))
   group = (page_group *)page - 1;
   else
--- 640,670 
 #ifdef USING_MALLOC_PAGE_GROUPS
   else
 {
   /* Allocate a large block of memory and serve out the aligned
pages therein.  This results in much less memory wastage
than the traditional implementation of valloc.  */

   char *allocation, *a, *enda;
   size_t alloc_size, head_slop, tail_slop;
   int multiple_pages = (entry_size == G.pagesize);

   if (multiple_pages)
   alloc_size = GGC_QUIRE_SIZE * G.pagesize;
   else
   alloc_size = entry_size + G.pagesize - 1;
!   allocation = xcalloc (1, alloc_size);

   page = (char *) (((size_t) allocation + G.pagesize - 1) 
& -G.pagesize);

   head_slop = page - allocation;
   if (multiple_pages)
   tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1);
   else
   tail_slop = alloc_size - entry_size - head_slop;
   enda = allocation + alloc_size - tail_slop;

   /* We allocated N pages, which are likely not aligned, leaving
us with N-1 usable pages.  We plan to place the page_group
structure somewhere in the slop.  */
   if (head_slop >= sizeof (page_group))
   group = (page_group *)page - 1;
   else


I suspect that it has stayed hidden for so long because most people
probably have this mmap_anon:

/* Define if mmap can get us zeroed pages using MAP_ANON(YMOUS). */
#undef HAVE_MMAP_ANON


#ifdef HAVE_MMAP_ANON
# undef HAVE_MMAP_DEV_ZERO

# include 
# ifndef MAP_FAILED
#  define MAP_FAILED -1
# endif
# if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
#  define MAP_ANONYMOUS MAP_ANON
# endif
# define USING_MMAP

#endif

#ifndef USING_MMAP
#define USING_MALLOC_PAGE_GROUPS
#endif


Seems pretty clear that zeroed pages are required.

Looking at the code above and below this block, it uses xcalloc
instead.

It will take a couple of days to confirm that this is the last
presumed bug affecting the MUSIC/SP port.

In the meantime, can someone confirm that this code is wrong,
and that xcalloc is definitely required?

I had a look at the gcc4 code, and it is calling XNEWVEC which
is also using xmalloc instead of xcalloc, so I presume it is still
a problem today, under the right circumstances.

It took about a week to track this one down.  :-)

One problem I have had for years, even on the MVS port, is that
I need to use -Os to get the exact same register selection on the
PC as the mainframe.  -O0 and -O2 get slightly different register
allocations, although all versions of the code are correct.  If I'm
lucky, this fix may make that problem go away, but as I said,
it'll take a couple of days before the results are

Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Tim Prince

Toon Moene wrote:

H.J. Lu wrote:

On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:

L.S.,

Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.

For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):

 subroutine sum(a, b, c, n)
 integer i, n
 real a(n), b(n), c(n)
 do i = 1, n
c(i) = a(i) + b(i)
 enddo
 end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

   xorps   %xmm2, %xmm2
   
.L6:
   movaps  %xmm2, %xmm0
   movaps  %xmm2, %xmm1
   movlps  (%r9,%rax), %xmm0
   movlps  (%r8,%rax), %xmm1
   movhps  8(%r9,%rax), %xmm0
   movhps  8(%r8,%rax), %xmm1
   incl%ecx
   addps   %xmm1, %xmm0
   movaps  %xmm0, 0(%rbp,%rax)
   addq$16, %rax
   cmpl%ebx, %ecx
   jb  .L6

I'm not a master of x86_64 assembly, but this strongly looks like 
%xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), 
before

they are completely filled with the mov{l,h}ps instructions ?



I think it is used to avoid partial SSE register stall.


You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for 
%xmm1) instruction (to copy 4*32 bits to the register) ?



If you want those, you must request them with -mtune=barcelona.


Re: i370 port - music/sp - possible generic gcc problem

2009-11-28 Thread Richard Guenther
On Sat, Nov 28, 2009 at 4:13 PM, Paul Edwards  wrote:
> I think I have found a bug in gcc, that still exists in gcc 4.4
>
> I found the problem on 3.2.3 though.
>
> While MVS and VM have basically been working fine, when I did
> the port to MUSIC/SP I started getting strange compilation failures.
>
> Initializing the stack to NULs made the problem go away, but I
> persisted, and instead initialized the stack to x'01' to try to get
> consistent failures.
>
> Even with that, and the malloced memory initialized to consistent
> garbage, I still didn't get the desired consistency in failures.
> But it was consistent enough that I could just run it 6 times and
> see if there were any failures.
>
>
> Anyway, I tracked down the particular malloc() which gave changed
> behaviour depending on whether the malloc() did a memory initialization
> to NULs or not.

Well, GC hands out non-zeroed memory - the callers are responsible
for initializing it.  So the fix below is not a fix but papering over an
issue elswhere.

Richard.

> It's this one (showing 3.2.3):
>
> C:\devel\gcc\gcc>cvs diff -l -c15
> cvs diff: Diffing .
> Index: ggc-page.c
> ===
> RCS file: c:\cvsroot/gcc/gcc/ggc-page.c,v
> retrieving revision 1.1.1.1
> diff -c -1 -5 -r1.1.1.1 ggc-page.c
> *** ggc-page.c  15 Feb 2006 10:22:25 -      1.1.1.1
> --- ggc-page.c  28 Nov 2009 14:13:41 -
> ***
> *** 640,670 
>  #ifdef USING_MALLOC_PAGE_GROUPS
>   else
>     {
>       /* Allocate a large block of memory and serve out the aligned
>        pages therein.  This results in much less memory wastage
>        than the traditional implementation of valloc.  */
>
>       char *allocation, *a, *enda;
>       size_t alloc_size, head_slop, tail_slop;
>       int multiple_pages = (entry_size == G.pagesize);
>
>       if (multiple_pages)
>       alloc_size = GGC_QUIRE_SIZE * G.pagesize;
>       else
>       alloc_size = entry_size + G.pagesize - 1;
> !       allocation = xmalloc (alloc_size);
>
>       page = (char *) (((size_t) allocation + G.pagesize - 1) &
> -G.pagesize);
>       head_slop = page - allocation;
>       if (multiple_pages)
>       tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1);
>       else
>       tail_slop = alloc_size - entry_size - head_slop;
>       enda = allocation + alloc_size - tail_slop;
>
>       /* We allocated N pages, which are likely not aligned, leaving
>        us with N-1 usable pages.  We plan to place the page_group
>        structure somewhere in the slop.  */
>       if (head_slop >= sizeof (page_group))
>       group = (page_group *)page - 1;
>       else
> --- 640,670 
>  #ifdef USING_MALLOC_PAGE_GROUPS
>   else
>     {
>       /* Allocate a large block of memory and serve out the aligned
>        pages therein.  This results in much less memory wastage
>        than the traditional implementation of valloc.  */
>
>       char *allocation, *a, *enda;
>       size_t alloc_size, head_slop, tail_slop;
>       int multiple_pages = (entry_size == G.pagesize);
>
>       if (multiple_pages)
>       alloc_size = GGC_QUIRE_SIZE * G.pagesize;
>       else
>       alloc_size = entry_size + G.pagesize - 1;
> !       allocation = xcalloc (1, alloc_size);
>
>       page = (char *) (((size_t) allocation + G.pagesize - 1) &
> -G.pagesize);
>       head_slop = page - allocation;
>       if (multiple_pages)
>       tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1);
>       else
>       tail_slop = alloc_size - entry_size - head_slop;
>       enda = allocation + alloc_size - tail_slop;
>
>       /* We allocated N pages, which are likely not aligned, leaving
>        us with N-1 usable pages.  We plan to place the page_group
>        structure somewhere in the slop.  */
>       if (head_slop >= sizeof (page_group))
>       group = (page_group *)page - 1;
>       else
>
>
> I suspect that it has stayed hidden for so long because most people
> probably have this mmap_anon:
>
> /* Define if mmap can get us zeroed pages using MAP_ANON(YMOUS). */
> #undef HAVE_MMAP_ANON
>
>
> #ifdef HAVE_MMAP_ANON
> # undef HAVE_MMAP_DEV_ZERO
>
> # include 
> # ifndef MAP_FAILED
> #  define MAP_FAILED -1
> # endif
> # if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
> #  define MAP_ANONYMOUS MAP_ANON
> # endif
> # define USING_MMAP
>
> #endif
>
> #ifndef USING_MMAP
> #define USING_MALLOC_PAGE_GROUPS
> #endif
>
>
> Seems pretty clear that zeroed pages are required.
>
> Looking at the code above and below this block, it uses xcalloc
> instead.
>
> It will take a couple of days to confirm that this is the last
> presumed bug affecting the MUSIC/SP port.
>
> In the meantime, can someone confirm that this code is wrong,
> and that xcalloc is definitely required?
>
> I had a look at the gcc4 code, and it is calling XNEWVEC which
> is also using xmalloc instead of xcalloc, so I presume it is still
> a problem today, under the right circumstances.
>
> It took

Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Richard Guenther
On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince  wrote:
> Toon Moene wrote:
>>
>> H.J. Lu wrote:
>>>
>>> On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:

 L.S.,

 Due to the discussion on register allocation, I went back to a hobby of
 mine: Studying the assembly output of the compiler.

 For this Fortran subroutine (note: unless otherwise told to the Fortran
 front end, reals are 32 bit floating point numbers):

     subroutine sum(a, b, c, n)
     integer i, n
     real a(n), b(n), c(n)
     do i = 1, n
        c(i) = a(i) + b(i)
     enddo
     end

 with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

       xorps   %xmm2, %xmm2
       
 .L6:
       movaps  %xmm2, %xmm0
       movaps  %xmm2, %xmm1
       movlps  (%r9,%rax), %xmm0
       movlps  (%r8,%rax), %xmm1
       movhps  8(%r9,%rax), %xmm0
       movhps  8(%r8,%rax), %xmm1
       incl    %ecx
       addps   %xmm1, %xmm0
       movaps  %xmm0, 0(%rbp,%rax)
       addq    $16, %rax
       cmpl    %ebx, %ecx
       jb      .L6

 I'm not a master of x86_64 assembly, but this strongly looks like
 %xmm{0,1}
 have to be zero'd (%xmm2 is set to zero by xor'ing it with itself),
 before
 they are completely filled with the mov{l,h}ps instructions ?

>>>
>>> I think it is used to avoid partial SSE register stall.
>>>
>>>
>> You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for
>> %xmm1) instruction (to copy 4*32 bits to the register) ?
>>
> If you want those, you must request them with -mtune=barcelona.

Which would then get you movups (%r9,%rax), %xmm0 (unaligned move).
generic tuning prefers the split moves, AMD Fam10 and above handle
unaligned moves just fine.

Richard.


Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Tim Prince

Richard Guenther wrote:

On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince  wrote:

Toon Moene wrote:

H.J. Lu wrote:

On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:

L.S.,

Due to the discussion on register allocation, I went back to a hobby of
mine: Studying the assembly output of the compiler.

For this Fortran subroutine (note: unless otherwise told to the Fortran
front end, reals are 32 bit floating point numbers):

subroutine sum(a, b, c, n)
integer i, n
real a(n), b(n), c(n)
do i = 1, n
   c(i) = a(i) + b(i)
enddo
end

with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:

  xorps   %xmm2, %xmm2
  
.L6:
  movaps  %xmm2, %xmm0
  movaps  %xmm2, %xmm1
  movlps  (%r9,%rax), %xmm0
  movlps  (%r8,%rax), %xmm1
  movhps  8(%r9,%rax), %xmm0
  movhps  8(%r8,%rax), %xmm1
  incl%ecx
  addps   %xmm1, %xmm0
  movaps  %xmm0, 0(%rbp,%rax)
  addq$16, %rax
  cmpl%ebx, %ecx
  jb  .L6

I'm not a master of x86_64 assembly, but this strongly looks like
%xmm{0,1}
have to be zero'd (%xmm2 is set to zero by xor'ing it with itself),
before
they are completely filled with the mov{l,h}ps instructions ?


I think it is used to avoid partial SSE register stall.



You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for
%xmm1) instruction (to copy 4*32 bits to the register) ?


If you want those, you must request them with -mtune=barcelona.


Which would then get you movups (%r9,%rax), %xmm0 (unaligned move).
generic tuning prefers the split moves, AMD Fam10 and above handle
unaligned moves just fine.

Correct, the movaps would have been used if alignment were recognized.
The newer CPUs achieve full performance with movups.
Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?"


Re: i370 port - music/sp - possible generic gcc problem

2009-11-28 Thread Paul Edwards

Anyway, I tracked down the particular malloc() which gave changed
behaviour depending on whether the malloc() did a memory initialization
to NULs or not.



Well, GC hands out non-zeroed memory - the callers are responsible
for initializing it.  So the fix below is not a fix but papering over an
issue elswhere.


Hi Richard.

If GC does that, then how come there is all this effort to do
mmap testing to see if it has the facility to zero memory, and
why is the surrounding code (in GCC 4.4's alloc_page())
calling XCNEWVEC instead of XNEWVEC?

Thanks.  Paul.



Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Richard Guenther
On Sat, Nov 28, 2009 at 5:31 PM, Tim Prince  wrote:
> Richard Guenther wrote:
>>
>> On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince  wrote:
>>>
>>> Toon Moene wrote:

 H.J. Lu wrote:
>
> On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene  wrote:
>>
>> L.S.,
>>
>> Due to the discussion on register allocation, I went back to a hobby
>> of
>> mine: Studying the assembly output of the compiler.
>>
>> For this Fortran subroutine (note: unless otherwise told to the
>> Fortran
>> front end, reals are 32 bit floating point numbers):
>>
>>    subroutine sum(a, b, c, n)
>>    integer i, n
>>    real a(n), b(n), c(n)
>>    do i = 1, n
>>       c(i) = a(i) + b(i)
>>    enddo
>>    end
>>
>> with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop:
>>
>>      xorps   %xmm2, %xmm2
>>      
>> .L6:
>>      movaps  %xmm2, %xmm0
>>      movaps  %xmm2, %xmm1
>>      movlps  (%r9,%rax), %xmm0
>>      movlps  (%r8,%rax), %xmm1
>>      movhps  8(%r9,%rax), %xmm0
>>      movhps  8(%r8,%rax), %xmm1
>>      incl    %ecx
>>      addps   %xmm1, %xmm0
>>      movaps  %xmm0, 0(%rbp,%rax)
>>      addq    $16, %rax
>>      cmpl    %ebx, %ecx
>>      jb      .L6
>>
>> I'm not a master of x86_64 assembly, but this strongly looks like
>> %xmm{0,1}
>> have to be zero'd (%xmm2 is set to zero by xor'ing it with itself),
>> before
>> they are completely filled with the mov{l,h}ps instructions ?
>>
> I think it is used to avoid partial SSE register stall.
>
>
 You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for
 %xmm1) instruction (to copy 4*32 bits to the register) ?

>>> If you want those, you must request them with -mtune=barcelona.
>>
>> Which would then get you movups (%r9,%rax), %xmm0 (unaligned move).
>> generic tuning prefers the split moves, AMD Fam10 and above handle
>> unaligned moves just fine.
>
> Correct, the movaps would have been used if alignment were recognized.
> The newer CPUs achieve full performance with movups.
> Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?"

I'd have to consult the optimization manual of those, but HJ may know
off-head.

Richard.


Re: i370 port - music/sp - possible generic gcc problem

2009-11-28 Thread Richard Guenther
On Sat, Nov 28, 2009 at 5:35 PM, Paul Edwards  wrote:
>>> Anyway, I tracked down the particular malloc() which gave changed
>>> behaviour depending on whether the malloc() did a memory initialization
>>> to NULs or not.
>
>> Well, GC hands out non-zeroed memory - the callers are responsible
>> for initializing it.  So the fix below is not a fix but papering over an
>> issue elswhere.
>
> Hi Richard.
>
> If GC does that, then how come there is all this effort to do
> mmap testing to see if it has the facility to zero memory, and

I can't see what you are refering to.

> why is the surrounding code (in GCC 4.4's alloc_page())
> calling XCNEWVEC instead of XNEWVEC?

That's the page table entries, not the data itself.

There wouldn't be the need for ggc_alloc_cleared if ggc_alloc
would already zero pages.

Richard.

> Thanks.  Paul.
>
>


Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Toon Moene

Tim Prince wrote:

> If you want those, you must request them with -mtune=barcelona.

OK, so it is an alignment issue (with -mtune=barcelona):

.L6:
movups  0(%rbp,%rax), %xmm0
movups  (%rbx,%rax), %xmm1
incl%ecx
addps   %xmm1, %xmm0
movaps  %xmm0, (%r8,%rax)
addq$16, %rax
cmpl%r10d, %ecx
jb  .L6

Thanks,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html


Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Toon Moene

Toon Moene wrote:


Tim Prince wrote:

 > If you want those, you must request them with -mtune=barcelona.

OK, so it is an alignment issue (with -mtune=barcelona):

.L6:
movups  0(%rbp,%rax), %xmm0
movups  (%rbx,%rax), %xmm1
incl%ecx
addps   %xmm1, %xmm0
movaps  %xmm0, (%r8,%rax)
addq$16, %rax
cmpl%r10d, %ecx
jb  .L6


Once this problem is solved (well, determined how it could be solved), 
we go on to the next, the extraneous induction variable %ecx.


There are two ways to deal with it:

1. Eliminate it with respect to the other induction variable that
   counts in the same direction (upwards, with steps 16) and remember
   that induction variable's (%rax) limit.

or:

2. Count %ecx down from %r10d to zero (which eliminates %r10d as a loop
   carried register).

g77 avoided this by coding counted do loops with a separate loop counter 
counting down to zero - not so with gfortran (quoting):


/* Translate the simple DO construct.  This is where the loop variable
   has integer type and step +-1.  We can't use this in the general case
   because integer overflow and floating point errors could give
   incorrect results.
   We translate a do loop from:

   DO dovar = from, to, step
  body
   END DO

   to:

   [Evaluate loop bounds and step]
   dovar = from;
   if ((step > 0) ? (dovar <= to) : (dovar => to))
{
  for (;;)
{
  body;
   cycle_label:
  cond = (dovar == to);
  dovar += step;
  if (cond) goto end_label;
}
  }
   end_label:

   This helps the optimizers by avoiding the extra induction variable
   used in the general case.  */

So either we teach the Fortran front end this trick, or we teach the 
loop optimization the trick of flipping the sense of a (n otherwise 
unused) induction variable 


--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html


Re: On the x86_64, does one have to zero a vector register before filling it completely ?

2009-11-28 Thread Tim Prince

Toon Moene wrote:

Toon Moene wrote:


Tim Prince wrote:

 > If you want those, you must request them with -mtune=barcelona.

OK, so it is an alignment issue (with -mtune=barcelona):

.L6:
movups  0(%rbp,%rax), %xmm0
movups  (%rbx,%rax), %xmm1
incl%ecx
addps   %xmm1, %xmm0
movaps  %xmm0, (%r8,%rax)
addq$16, %rax
cmpl%r10d, %ecx
jb  .L6


Once this problem is solved (well, determined how it could be solved), 
we go on to the next, the extraneous induction variable %ecx.


There are two ways to deal with it:

1. Eliminate it with respect to the other induction variable that
   counts in the same direction (upwards, with steps 16) and remember
   that induction variable's (%rax) limit.

or:

2. Count %ecx down from %r10d to zero (which eliminates %r10d as a loop
   carried register).

g77 avoided this by coding counted do loops with a separate loop counter 
counting down to zero - not so with gfortran (quoting):


/* Translate the simple DO construct.  This is where the loop variable
   has integer type and step +-1.  We can't use this in the general case
   because integer overflow and floating point errors could give
   incorrect results.
   We translate a do loop from:

   DO dovar = from, to, step
  body
   END DO

   to:

   [Evaluate loop bounds and step]
   dovar = from;
   if ((step > 0) ? (dovar <= to) : (dovar => to))
{
  for (;;)
{
  body;
   cycle_label:
  cond = (dovar == to);
  dovar += step;
  if (cond) goto end_label;
}
  }
   end_label:

   This helps the optimizers by avoiding the extra induction variable
   used in the general case.  */

So either we teach the Fortran front end this trick, or we teach the 
loop optimization the trick of flipping the sense of a (n otherwise 
unused) induction variable 


This would have paid off more frequently in i386 mode, where there is a 
possibility of integer register pressure in loops small enough for such 
an optimization to succeed.
This seems to be among the types of optimizations envisioned for 
run-time binary interpretation systems.


avr: optimizing assignment to a bit field

2009-11-28 Thread Shaun Jackman
When assigning a bool to a single bit of a bitfield located in the
bit-addressable region of memory, better code is produced by
if (flag)
bitfield.bit = true;
else
bitfield.bit = false;
than
bitfield.bit = flag;

I've included a short test and the assembler output by both forms.
Should I file a bug suggesting a possible improvement here?

Cheers,
Shaun

#include 
#include 

struct byte { uint8_t x0:1; uint8_t x1:1; uint8_t x2:1; uint8_t x3:1;
uint8_t x4:1; uint8_t x5:1; uint8_t x6:1; uint8_t x7:1; };

volatile struct byte *const porte = (void*)0x23;

void set_flag_good(bool flag)
{
if (flag)
porte->x6 = true;
else
porte->x6 = false;
}

void set_flag_bad(bool flag)
{
porte->x6 = flag;
}


 :
   0:   88 23   and r24, r24
   2:   01 f4   brne.+0 ; 0x4 
2: R_AVR_7_PCREL.text+0x8
   4:   1e 98   cbi 0x03, 6 ; 3
   6:   08 95   ret
   8:   1e 9a   sbi 0x03, 6 ; 3
   a:   08 95   ret

000c :
   c:   81 70   andir24, 0x01   ; 1
   e:   82 95   swapr24
  10:   88 0f   add r24, r24
  12:   88 0f   add r24, r24
  14:   80 7c   andir24, 0xC0   ; 192
  16:   93 b1   in  r25, 0x03   ; 3
  18:   9f 7b   andir25, 0xBF   ; 191
  1a:   98 2b   or  r25, r24
  1c:   93 b9   out 0x03, r25   ; 3
  1e:   08 95   ret


Re: avr: optimizing assignment to a bit field

2009-11-28 Thread Richard Guenther
On Sat, Nov 28, 2009 at 11:43 PM, Shaun Jackman  wrote:
> When assigning a bool to a single bit of a bitfield located in the
> bit-addressable region of memory, better code is produced by
>        if (flag)
>                bitfield.bit = true;
>        else
>                bitfield.bit = false;
> than
>        bitfield.bit = flag;
>
> I've included a short test and the assembler output by both forms.
> Should I file a bug suggesting a possible improvement here?

Yes, a bugreport is useful - but there might be a bug with this issue
already.

Richard.

> Cheers,
> Shaun
>
> #include 
> #include 
>
> struct byte { uint8_t x0:1; uint8_t x1:1; uint8_t x2:1; uint8_t x3:1;
>        uint8_t x4:1; uint8_t x5:1; uint8_t x6:1; uint8_t x7:1; };
>
> volatile struct byte *const porte = (void*)0x23;
>
> void set_flag_good(bool flag)
> {
>        if (flag)
>                porte->x6 = true;
>        else
>                porte->x6 = false;
> }
>
> void set_flag_bad(bool flag)
> {
>        porte->x6 = flag;
> }
>
>
>  :
>   0:   88 23           and     r24, r24
>   2:   01 f4           brne    .+0             ; 0x4 
>                        2: R_AVR_7_PCREL        .text+0x8
>   4:   1e 98           cbi     0x03, 6 ; 3
>   6:   08 95           ret
>   8:   1e 9a           sbi     0x03, 6 ; 3
>   a:   08 95           ret
>
> 000c :
>   c:   81 70           andi    r24, 0x01       ; 1
>   e:   82 95           swap    r24
>  10:   88 0f           add     r24, r24
>  12:   88 0f           add     r24, r24
>  14:   80 7c           andi    r24, 0xC0       ; 192
>  16:   93 b1           in      r25, 0x03       ; 3
>  18:   9f 7b           andi    r25, 0xBF       ; 191
>  1a:   98 2b           or      r25, r24
>  1c:   93 b9           out     0x03, r25       ; 3
>  1e:   08 95           ret
>


Re: i370 port - music/sp - possible generic gcc problem

2009-11-28 Thread Paul Edwards

If GC does that, then how come there is all this effort to do
mmap testing to see if it has the facility to zero memory, and



I can't see what you are refering to.


I obviously misinterpreted that then.


why is the surrounding code (in GCC 4.4's alloc_page())
calling XCNEWVEC instead of XNEWVEC?



That's the page table entries, not the data itself.



There wouldn't be the need for ggc_alloc_cleared if ggc_alloc
would already zero pages.


Ok, based on this, I traced it back further:

rtx
gen_rtx_fmt_e0 (code, mode, arg0)
RTX_CODE code;
enum machine_mode mode;
rtx arg0;
{
 rtx rt;
 rt = ggc_alloc_rtx (2);
 memset (rt, 0, sizeof (struct rtx_def) - sizeof (rtunion));



rtx
gen_rtx_MEM (mode, addr)
enum machine_mode mode;
rtx addr;
{
 rtx rt = gen_rtx_raw_MEM (mode, addr);

 /* This field is not cleared by the mere allocation of the rtx, so
we clear it here.  */
 MEM_ATTRS (rt) = 0;

 return rt;
}



void
init_caller_save ()
{
...
 for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
   for (mode = 0 ; mode < MAX_MACHINE_MODE; mode++)
 if (HARD_REGNO_MODE_OK (i, mode))
   {
 rtx mem = gen_rtx_MEM (mode, address);


And indeed, this code - caller_save, shows up as one of the symptoms.
One of the symptoms is that where it is meant to be saving a register
prior to a function call, it instead generates a completely stupid
instruction - and that stupid instruction happens to be the first
instruction listed in i370.md.

Right at the moment the symptom is:

/ID SAVE-JOB-123456 @BLD000    
/SYS REGION=,XREGION=64M
/FILE SYSPRINT PRT OSRECFM(F) OSLRECL(256)
/FILE SYSIN N(MATH.C) SHR
/FILE O N(MATH.S) NEW RECFM(V) SP(1000)
/FILE SYSINCL PDS(*.H)
/FILE INCLUDE PDS(*.H)
/PARM -Os -S -DUSE_MEMMGR -DMUSIC -o dd:o -
/LOAD XMON
: In function `acos':
:137: Internal compiler error in ?, at :724
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html> for instructions.

compiling something from my C runtime library.  But I basically get
different behaviour depending on what I initialize uninitialized memory
to.

Any ideas on what to look for now?  I can probably isolate which bit
of the malloc()ed memory is being referenced before initialization.
But finding out who is doing the reference could be tricky.

I might be able to trace it from a different approach, getting more
information about that internal error, now that I know where some
of the involved memory is.  I'll get that  converted into a
PC filename first.

BFN.  Paul.



Re: avr: optimizing assignment to a bit field

2009-11-28 Thread Shaun Jackman
2009/11/28 Richard Guenther :
> On Sat, Nov 28, 2009 at 11:43 PM, Shaun Jackman  wrote:
>> When assigning a bool to a single bit of a bitfield located in the
>> bit-addressable region of memory, better code is produced by
>>        if (flag)
>>                bitfield.bit = true;
>>        else
>>                bitfield.bit = false;
>> than
>>        bitfield.bit = flag;
>>
>> I've included a short test and the assembler output by both forms.
>> Should I file a bug suggesting a possible improvement here?
>
> Yes, a bugreport is useful - but there might be a bug with this issue
> already.

Reported here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42210

Cheers,
Shaun


Re: i370 port - music/sp - possible generic gcc problem

2009-11-28 Thread Paul Edwards

: In function `acos':
:137: Internal compiler error in ?, at :724
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html> for instructions.

I might be able to trace it from a different approach, getting more
information about that internal error, now that I know where some
of the involved memory is.  I'll get that  converted into a
PC filename first.


6 hours of compilation later, I was unsuccessful in getting the proper
filename displayed.  As far as I can tell, it's aborting but not displaying
any output.  ie randomly displaying the above message.

However, not to worry, since there's only one line 724 with an
abort() in it, and it's this bit of code:

static int
insert_save (chain, before_p, regno, to_save, save_mode)
struct insn_chain *chain;
int before_p;
int regno;
HARD_REG_SET *to_save;
enum machine_mode *save_mode;
{
 int i;
 unsigned int k;
 rtx pat = NULL_RTX;
 int code;
 unsigned int numregs = 0;
 struct insn_chain *new;
 rtx mem;

 /* A common failure mode if register status is not correct in the RTL
is for this routine to be called with a REGNO we didn't expect to
save.  That will cause us to write an insn with a (nil) SET_DEST
or SET_SRC.  Instead of doing so and causing a crash later, check
for this common case and abort here instead.  This will remove one
step in debugging such problems.  */

 if (regno_save_mem[regno][1] == 0)
   abort ();


which is in the same file as the init_caller_save() function that
allocated the memory in the first place.

One fortunate thing is that this source file is under 1000 lines
long so hopefully amenable to debugging.

BFN.  Paul.