On the x86_64, does one have to zero a vector register before filling it completely ?
L.S., Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler. For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers): subroutine sum(a, b, c, n) integer i, n real a(n), b(n), c(n) do i = 1, n c(i) = a(i) + b(i) enddo end with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: xorps %xmm2, %xmm2 .L6: movaps %xmm2, %xmm0 movaps %xmm2, %xmm1 movlps (%r9,%rax), %xmm0 movlps (%r8,%rax), %xmm1 movhps 8(%r9,%rax), %xmm0 movhps 8(%r8,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, 0(%rbp,%rax) addq$16, %rax cmpl%ebx, %ecx jb .L6 I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ? Am I missing something ? [ BTW, the induction variable %ecx could have been eliminated, because %rax also counts upwards (but 16 at a time instead of 1) ] Thanks for any insight, -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/ Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: > L.S., > > Due to the discussion on register allocation, I went back to a hobby of > mine: Studying the assembly output of the compiler. > > For this Fortran subroutine (note: unless otherwise told to the Fortran > front end, reals are 32 bit floating point numbers): > > subroutine sum(a, b, c, n) > integer i, n > real a(n), b(n), c(n) > do i = 1, n > c(i) = a(i) + b(i) > enddo > end > > with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: > > xorps %xmm2, %xmm2 > > .L6: > movaps %xmm2, %xmm0 > movaps %xmm2, %xmm1 > movlps (%r9,%rax), %xmm0 > movlps (%r8,%rax), %xmm1 > movhps 8(%r9,%rax), %xmm0 > movhps 8(%r8,%rax), %xmm1 > incl %ecx > addps %xmm1, %xmm0 > movaps %xmm0, 0(%rbp,%rax) > addq $16, %rax > cmpl %ebx, %ecx > jb .L6 > > I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} > have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before > they are completely filled with the mov{l,h}ps instructions ? > I think it is used to avoid partial SSE register stall. -- H.J.
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
H.J. Lu wrote: On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: L.S., Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler. For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers): subroutine sum(a, b, c, n) integer i, n real a(n), b(n), c(n) do i = 1, n c(i) = a(i) + b(i) enddo end with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: xorps %xmm2, %xmm2 .L6: movaps %xmm2, %xmm0 movaps %xmm2, %xmm1 movlps (%r9,%rax), %xmm0 movlps (%r8,%rax), %xmm1 movhps 8(%r9,%rax), %xmm0 movhps 8(%r8,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, 0(%rbp,%rax) addq$16, %rax cmpl%ebx, %ecx jb .L6 I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ? I think it is used to avoid partial SSE register stall. You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for %xmm1) instruction (to copy 4*32 bits to the register) ? -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/ Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html
i370 port - music/sp - possible generic gcc problem
I think I have found a bug in gcc, that still exists in gcc 4.4 I found the problem on 3.2.3 though. While MVS and VM have basically been working fine, when I did the port to MUSIC/SP I started getting strange compilation failures. Initializing the stack to NULs made the problem go away, but I persisted, and instead initialized the stack to x'01' to try to get consistent failures. Even with that, and the malloced memory initialized to consistent garbage, I still didn't get the desired consistency in failures. But it was consistent enough that I could just run it 6 times and see if there were any failures. Anyway, I tracked down the particular malloc() which gave changed behaviour depending on whether the malloc() did a memory initialization to NULs or not. It's this one (showing 3.2.3): C:\devel\gcc\gcc>cvs diff -l -c15 cvs diff: Diffing . Index: ggc-page.c === RCS file: c:\cvsroot/gcc/gcc/ggc-page.c,v retrieving revision 1.1.1.1 diff -c -1 -5 -r1.1.1.1 ggc-page.c *** ggc-page.c 15 Feb 2006 10:22:25 - 1.1.1.1 --- ggc-page.c 28 Nov 2009 14:13:41 - *** *** 640,670 #ifdef USING_MALLOC_PAGE_GROUPS else { /* Allocate a large block of memory and serve out the aligned pages therein. This results in much less memory wastage than the traditional implementation of valloc. */ char *allocation, *a, *enda; size_t alloc_size, head_slop, tail_slop; int multiple_pages = (entry_size == G.pagesize); if (multiple_pages) alloc_size = GGC_QUIRE_SIZE * G.pagesize; else alloc_size = entry_size + G.pagesize - 1; ! allocation = xmalloc (alloc_size); page = (char *) (((size_t) allocation + G.pagesize - 1) & -G.pagesize); head_slop = page - allocation; if (multiple_pages) tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1); else tail_slop = alloc_size - entry_size - head_slop; enda = allocation + alloc_size - tail_slop; /* We allocated N pages, which are likely not aligned, leaving us with N-1 usable pages. We plan to place the page_group structure somewhere in the slop. */ if (head_slop >= sizeof (page_group)) group = (page_group *)page - 1; else --- 640,670 #ifdef USING_MALLOC_PAGE_GROUPS else { /* Allocate a large block of memory and serve out the aligned pages therein. This results in much less memory wastage than the traditional implementation of valloc. */ char *allocation, *a, *enda; size_t alloc_size, head_slop, tail_slop; int multiple_pages = (entry_size == G.pagesize); if (multiple_pages) alloc_size = GGC_QUIRE_SIZE * G.pagesize; else alloc_size = entry_size + G.pagesize - 1; ! allocation = xcalloc (1, alloc_size); page = (char *) (((size_t) allocation + G.pagesize - 1) & -G.pagesize); head_slop = page - allocation; if (multiple_pages) tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1); else tail_slop = alloc_size - entry_size - head_slop; enda = allocation + alloc_size - tail_slop; /* We allocated N pages, which are likely not aligned, leaving us with N-1 usable pages. We plan to place the page_group structure somewhere in the slop. */ if (head_slop >= sizeof (page_group)) group = (page_group *)page - 1; else I suspect that it has stayed hidden for so long because most people probably have this mmap_anon: /* Define if mmap can get us zeroed pages using MAP_ANON(YMOUS). */ #undef HAVE_MMAP_ANON #ifdef HAVE_MMAP_ANON # undef HAVE_MMAP_DEV_ZERO # include # ifndef MAP_FAILED # define MAP_FAILED -1 # endif # if !defined (MAP_ANONYMOUS) && defined (MAP_ANON) # define MAP_ANONYMOUS MAP_ANON # endif # define USING_MMAP #endif #ifndef USING_MMAP #define USING_MALLOC_PAGE_GROUPS #endif Seems pretty clear that zeroed pages are required. Looking at the code above and below this block, it uses xcalloc instead. It will take a couple of days to confirm that this is the last presumed bug affecting the MUSIC/SP port. In the meantime, can someone confirm that this code is wrong, and that xcalloc is definitely required? I had a look at the gcc4 code, and it is calling XNEWVEC which is also using xmalloc instead of xcalloc, so I presume it is still a problem today, under the right circumstances. It took about a week to track this one down. :-) One problem I have had for years, even on the MVS port, is that I need to use -Os to get the exact same register selection on the PC as the mainframe. -O0 and -O2 get slightly different register allocations, although all versions of the code are correct. If I'm lucky, this fix may make that problem go away, but as I said, it'll take a couple of days before the results are
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Toon Moene wrote: H.J. Lu wrote: On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: L.S., Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler. For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers): subroutine sum(a, b, c, n) integer i, n real a(n), b(n), c(n) do i = 1, n c(i) = a(i) + b(i) enddo end with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: xorps %xmm2, %xmm2 .L6: movaps %xmm2, %xmm0 movaps %xmm2, %xmm1 movlps (%r9,%rax), %xmm0 movlps (%r8,%rax), %xmm1 movhps 8(%r9,%rax), %xmm0 movhps 8(%r8,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, 0(%rbp,%rax) addq$16, %rax cmpl%ebx, %ecx jb .L6 I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ? I think it is used to avoid partial SSE register stall. You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for %xmm1) instruction (to copy 4*32 bits to the register) ? If you want those, you must request them with -mtune=barcelona.
Re: i370 port - music/sp - possible generic gcc problem
On Sat, Nov 28, 2009 at 4:13 PM, Paul Edwards wrote: > I think I have found a bug in gcc, that still exists in gcc 4.4 > > I found the problem on 3.2.3 though. > > While MVS and VM have basically been working fine, when I did > the port to MUSIC/SP I started getting strange compilation failures. > > Initializing the stack to NULs made the problem go away, but I > persisted, and instead initialized the stack to x'01' to try to get > consistent failures. > > Even with that, and the malloced memory initialized to consistent > garbage, I still didn't get the desired consistency in failures. > But it was consistent enough that I could just run it 6 times and > see if there were any failures. > > > Anyway, I tracked down the particular malloc() which gave changed > behaviour depending on whether the malloc() did a memory initialization > to NULs or not. Well, GC hands out non-zeroed memory - the callers are responsible for initializing it. So the fix below is not a fix but papering over an issue elswhere. Richard. > It's this one (showing 3.2.3): > > C:\devel\gcc\gcc>cvs diff -l -c15 > cvs diff: Diffing . > Index: ggc-page.c > === > RCS file: c:\cvsroot/gcc/gcc/ggc-page.c,v > retrieving revision 1.1.1.1 > diff -c -1 -5 -r1.1.1.1 ggc-page.c > *** ggc-page.c 15 Feb 2006 10:22:25 - 1.1.1.1 > --- ggc-page.c 28 Nov 2009 14:13:41 - > *** > *** 640,670 > #ifdef USING_MALLOC_PAGE_GROUPS > else > { > /* Allocate a large block of memory and serve out the aligned > pages therein. This results in much less memory wastage > than the traditional implementation of valloc. */ > > char *allocation, *a, *enda; > size_t alloc_size, head_slop, tail_slop; > int multiple_pages = (entry_size == G.pagesize); > > if (multiple_pages) > alloc_size = GGC_QUIRE_SIZE * G.pagesize; > else > alloc_size = entry_size + G.pagesize - 1; > ! allocation = xmalloc (alloc_size); > > page = (char *) (((size_t) allocation + G.pagesize - 1) & > -G.pagesize); > head_slop = page - allocation; > if (multiple_pages) > tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1); > else > tail_slop = alloc_size - entry_size - head_slop; > enda = allocation + alloc_size - tail_slop; > > /* We allocated N pages, which are likely not aligned, leaving > us with N-1 usable pages. We plan to place the page_group > structure somewhere in the slop. */ > if (head_slop >= sizeof (page_group)) > group = (page_group *)page - 1; > else > --- 640,670 > #ifdef USING_MALLOC_PAGE_GROUPS > else > { > /* Allocate a large block of memory and serve out the aligned > pages therein. This results in much less memory wastage > than the traditional implementation of valloc. */ > > char *allocation, *a, *enda; > size_t alloc_size, head_slop, tail_slop; > int multiple_pages = (entry_size == G.pagesize); > > if (multiple_pages) > alloc_size = GGC_QUIRE_SIZE * G.pagesize; > else > alloc_size = entry_size + G.pagesize - 1; > ! allocation = xcalloc (1, alloc_size); > > page = (char *) (((size_t) allocation + G.pagesize - 1) & > -G.pagesize); > head_slop = page - allocation; > if (multiple_pages) > tail_slop = ((size_t) allocation + alloc_size) & (G.pagesize - 1); > else > tail_slop = alloc_size - entry_size - head_slop; > enda = allocation + alloc_size - tail_slop; > > /* We allocated N pages, which are likely not aligned, leaving > us with N-1 usable pages. We plan to place the page_group > structure somewhere in the slop. */ > if (head_slop >= sizeof (page_group)) > group = (page_group *)page - 1; > else > > > I suspect that it has stayed hidden for so long because most people > probably have this mmap_anon: > > /* Define if mmap can get us zeroed pages using MAP_ANON(YMOUS). */ > #undef HAVE_MMAP_ANON > > > #ifdef HAVE_MMAP_ANON > # undef HAVE_MMAP_DEV_ZERO > > # include > # ifndef MAP_FAILED > # define MAP_FAILED -1 > # endif > # if !defined (MAP_ANONYMOUS) && defined (MAP_ANON) > # define MAP_ANONYMOUS MAP_ANON > # endif > # define USING_MMAP > > #endif > > #ifndef USING_MMAP > #define USING_MALLOC_PAGE_GROUPS > #endif > > > Seems pretty clear that zeroed pages are required. > > Looking at the code above and below this block, it uses xcalloc > instead. > > It will take a couple of days to confirm that this is the last > presumed bug affecting the MUSIC/SP port. > > In the meantime, can someone confirm that this code is wrong, > and that xcalloc is definitely required? > > I had a look at the gcc4 code, and it is calling XNEWVEC which > is also using xmalloc instead of xcalloc, so I presume it is still > a problem today, under the right circumstances. > > It took
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince wrote: > Toon Moene wrote: >> >> H.J. Lu wrote: >>> >>> On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: L.S., Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler. For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers): subroutine sum(a, b, c, n) integer i, n real a(n), b(n), c(n) do i = 1, n c(i) = a(i) + b(i) enddo end with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: xorps %xmm2, %xmm2 .L6: movaps %xmm2, %xmm0 movaps %xmm2, %xmm1 movlps (%r9,%rax), %xmm0 movlps (%r8,%rax), %xmm1 movhps 8(%r9,%rax), %xmm0 movhps 8(%r8,%rax), %xmm1 incl %ecx addps %xmm1, %xmm0 movaps %xmm0, 0(%rbp,%rax) addq $16, %rax cmpl %ebx, %ecx jb .L6 I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ? >>> >>> I think it is used to avoid partial SSE register stall. >>> >>> >> You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for >> %xmm1) instruction (to copy 4*32 bits to the register) ? >> > If you want those, you must request them with -mtune=barcelona. Which would then get you movups (%r9,%rax), %xmm0 (unaligned move). generic tuning prefers the split moves, AMD Fam10 and above handle unaligned moves just fine. Richard.
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Richard Guenther wrote: On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince wrote: Toon Moene wrote: H.J. Lu wrote: On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: L.S., Due to the discussion on register allocation, I went back to a hobby of mine: Studying the assembly output of the compiler. For this Fortran subroutine (note: unless otherwise told to the Fortran front end, reals are 32 bit floating point numbers): subroutine sum(a, b, c, n) integer i, n real a(n), b(n), c(n) do i = 1, n c(i) = a(i) + b(i) enddo end with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: xorps %xmm2, %xmm2 .L6: movaps %xmm2, %xmm0 movaps %xmm2, %xmm1 movlps (%r9,%rax), %xmm0 movlps (%r8,%rax), %xmm1 movhps 8(%r9,%rax), %xmm0 movhps 8(%r8,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, 0(%rbp,%rax) addq$16, %rax cmpl%ebx, %ecx jb .L6 I'm not a master of x86_64 assembly, but this strongly looks like %xmm{0,1} have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), before they are completely filled with the mov{l,h}ps instructions ? I think it is used to avoid partial SSE register stall. You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for %xmm1) instruction (to copy 4*32 bits to the register) ? If you want those, you must request them with -mtune=barcelona. Which would then get you movups (%r9,%rax), %xmm0 (unaligned move). generic tuning prefers the split moves, AMD Fam10 and above handle unaligned moves just fine. Correct, the movaps would have been used if alignment were recognized. The newer CPUs achieve full performance with movups. Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?"
Re: i370 port - music/sp - possible generic gcc problem
Anyway, I tracked down the particular malloc() which gave changed behaviour depending on whether the malloc() did a memory initialization to NULs or not. Well, GC hands out non-zeroed memory - the callers are responsible for initializing it. So the fix below is not a fix but papering over an issue elswhere. Hi Richard. If GC does that, then how come there is all this effort to do mmap testing to see if it has the facility to zero memory, and why is the surrounding code (in GCC 4.4's alloc_page()) calling XCNEWVEC instead of XNEWVEC? Thanks. Paul.
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
On Sat, Nov 28, 2009 at 5:31 PM, Tim Prince wrote: > Richard Guenther wrote: >> >> On Sat, Nov 28, 2009 at 4:26 PM, Tim Prince wrote: >>> >>> Toon Moene wrote: H.J. Lu wrote: > > On Sat, Nov 28, 2009 at 3:21 AM, Toon Moene wrote: >> >> L.S., >> >> Due to the discussion on register allocation, I went back to a hobby >> of >> mine: Studying the assembly output of the compiler. >> >> For this Fortran subroutine (note: unless otherwise told to the >> Fortran >> front end, reals are 32 bit floating point numbers): >> >> subroutine sum(a, b, c, n) >> integer i, n >> real a(n), b(n), c(n) >> do i = 1, n >> c(i) = a(i) + b(i) >> enddo >> end >> >> with -O3 -S (GCC: (GNU) 4.5.0 20091123), I get this (vectorized) loop: >> >> xorps %xmm2, %xmm2 >> >> .L6: >> movaps %xmm2, %xmm0 >> movaps %xmm2, %xmm1 >> movlps (%r9,%rax), %xmm0 >> movlps (%r8,%rax), %xmm1 >> movhps 8(%r9,%rax), %xmm0 >> movhps 8(%r8,%rax), %xmm1 >> incl %ecx >> addps %xmm1, %xmm0 >> movaps %xmm0, 0(%rbp,%rax) >> addq $16, %rax >> cmpl %ebx, %ecx >> jb .L6 >> >> I'm not a master of x86_64 assembly, but this strongly looks like >> %xmm{0,1} >> have to be zero'd (%xmm2 is set to zero by xor'ing it with itself), >> before >> they are completely filled with the mov{l,h}ps instructions ? >> > I think it is used to avoid partial SSE register stall. > > You mean there's no movaps (%r9,%rax), %xmm0 (and mutatis mutandis for %xmm1) instruction (to copy 4*32 bits to the register) ? >>> If you want those, you must request them with -mtune=barcelona. >> >> Which would then get you movups (%r9,%rax), %xmm0 (unaligned move). >> generic tuning prefers the split moves, AMD Fam10 and above handle >> unaligned moves just fine. > > Correct, the movaps would have been used if alignment were recognized. > The newer CPUs achieve full performance with movups. > Do you consider Core i7/Nehalem as included in "AMD Fam10 and above?" I'd have to consult the optimization manual of those, but HJ may know off-head. Richard.
Re: i370 port - music/sp - possible generic gcc problem
On Sat, Nov 28, 2009 at 5:35 PM, Paul Edwards wrote: >>> Anyway, I tracked down the particular malloc() which gave changed >>> behaviour depending on whether the malloc() did a memory initialization >>> to NULs or not. > >> Well, GC hands out non-zeroed memory - the callers are responsible >> for initializing it. So the fix below is not a fix but papering over an >> issue elswhere. > > Hi Richard. > > If GC does that, then how come there is all this effort to do > mmap testing to see if it has the facility to zero memory, and I can't see what you are refering to. > why is the surrounding code (in GCC 4.4's alloc_page()) > calling XCNEWVEC instead of XNEWVEC? That's the page table entries, not the data itself. There wouldn't be the need for ggc_alloc_cleared if ggc_alloc would already zero pages. Richard. > Thanks. Paul. > >
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Tim Prince wrote: > If you want those, you must request them with -mtune=barcelona. OK, so it is an alignment issue (with -mtune=barcelona): .L6: movups 0(%rbp,%rax), %xmm0 movups (%rbx,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, (%r8,%rax) addq$16, %rax cmpl%r10d, %ecx jb .L6 Thanks, -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/ Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Toon Moene wrote: Tim Prince wrote: > If you want those, you must request them with -mtune=barcelona. OK, so it is an alignment issue (with -mtune=barcelona): .L6: movups 0(%rbp,%rax), %xmm0 movups (%rbx,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, (%r8,%rax) addq$16, %rax cmpl%r10d, %ecx jb .L6 Once this problem is solved (well, determined how it could be solved), we go on to the next, the extraneous induction variable %ecx. There are two ways to deal with it: 1. Eliminate it with respect to the other induction variable that counts in the same direction (upwards, with steps 16) and remember that induction variable's (%rax) limit. or: 2. Count %ecx down from %r10d to zero (which eliminates %r10d as a loop carried register). g77 avoided this by coding counted do loops with a separate loop counter counting down to zero - not so with gfortran (quoting): /* Translate the simple DO construct. This is where the loop variable has integer type and step +-1. We can't use this in the general case because integer overflow and floating point errors could give incorrect results. We translate a do loop from: DO dovar = from, to, step body END DO to: [Evaluate loop bounds and step] dovar = from; if ((step > 0) ? (dovar <= to) : (dovar => to)) { for (;;) { body; cycle_label: cond = (dovar == to); dovar += step; if (cond) goto end_label; } } end_label: This helps the optimizers by avoiding the extra induction variable used in the general case. */ So either we teach the Fortran front end this trick, or we teach the loop optimization the trick of flipping the sense of a (n otherwise unused) induction variable -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/ Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html
Re: On the x86_64, does one have to zero a vector register before filling it completely ?
Toon Moene wrote: Toon Moene wrote: Tim Prince wrote: > If you want those, you must request them with -mtune=barcelona. OK, so it is an alignment issue (with -mtune=barcelona): .L6: movups 0(%rbp,%rax), %xmm0 movups (%rbx,%rax), %xmm1 incl%ecx addps %xmm1, %xmm0 movaps %xmm0, (%r8,%rax) addq$16, %rax cmpl%r10d, %ecx jb .L6 Once this problem is solved (well, determined how it could be solved), we go on to the next, the extraneous induction variable %ecx. There are two ways to deal with it: 1. Eliminate it with respect to the other induction variable that counts in the same direction (upwards, with steps 16) and remember that induction variable's (%rax) limit. or: 2. Count %ecx down from %r10d to zero (which eliminates %r10d as a loop carried register). g77 avoided this by coding counted do loops with a separate loop counter counting down to zero - not so with gfortran (quoting): /* Translate the simple DO construct. This is where the loop variable has integer type and step +-1. We can't use this in the general case because integer overflow and floating point errors could give incorrect results. We translate a do loop from: DO dovar = from, to, step body END DO to: [Evaluate loop bounds and step] dovar = from; if ((step > 0) ? (dovar <= to) : (dovar => to)) { for (;;) { body; cycle_label: cond = (dovar == to); dovar += step; if (cond) goto end_label; } } end_label: This helps the optimizers by avoiding the extra induction variable used in the general case. */ So either we teach the Fortran front end this trick, or we teach the loop optimization the trick of flipping the sense of a (n otherwise unused) induction variable This would have paid off more frequently in i386 mode, where there is a possibility of integer register pressure in loops small enough for such an optimization to succeed. This seems to be among the types of optimizations envisioned for run-time binary interpretation systems.
avr: optimizing assignment to a bit field
When assigning a bool to a single bit of a bitfield located in the bit-addressable region of memory, better code is produced by if (flag) bitfield.bit = true; else bitfield.bit = false; than bitfield.bit = flag; I've included a short test and the assembler output by both forms. Should I file a bug suggesting a possible improvement here? Cheers, Shaun #include #include struct byte { uint8_t x0:1; uint8_t x1:1; uint8_t x2:1; uint8_t x3:1; uint8_t x4:1; uint8_t x5:1; uint8_t x6:1; uint8_t x7:1; }; volatile struct byte *const porte = (void*)0x23; void set_flag_good(bool flag) { if (flag) porte->x6 = true; else porte->x6 = false; } void set_flag_bad(bool flag) { porte->x6 = flag; } : 0: 88 23 and r24, r24 2: 01 f4 brne.+0 ; 0x4 2: R_AVR_7_PCREL.text+0x8 4: 1e 98 cbi 0x03, 6 ; 3 6: 08 95 ret 8: 1e 9a sbi 0x03, 6 ; 3 a: 08 95 ret 000c : c: 81 70 andir24, 0x01 ; 1 e: 82 95 swapr24 10: 88 0f add r24, r24 12: 88 0f add r24, r24 14: 80 7c andir24, 0xC0 ; 192 16: 93 b1 in r25, 0x03 ; 3 18: 9f 7b andir25, 0xBF ; 191 1a: 98 2b or r25, r24 1c: 93 b9 out 0x03, r25 ; 3 1e: 08 95 ret
Re: avr: optimizing assignment to a bit field
On Sat, Nov 28, 2009 at 11:43 PM, Shaun Jackman wrote: > When assigning a bool to a single bit of a bitfield located in the > bit-addressable region of memory, better code is produced by > if (flag) > bitfield.bit = true; > else > bitfield.bit = false; > than > bitfield.bit = flag; > > I've included a short test and the assembler output by both forms. > Should I file a bug suggesting a possible improvement here? Yes, a bugreport is useful - but there might be a bug with this issue already. Richard. > Cheers, > Shaun > > #include > #include > > struct byte { uint8_t x0:1; uint8_t x1:1; uint8_t x2:1; uint8_t x3:1; > uint8_t x4:1; uint8_t x5:1; uint8_t x6:1; uint8_t x7:1; }; > > volatile struct byte *const porte = (void*)0x23; > > void set_flag_good(bool flag) > { > if (flag) > porte->x6 = true; > else > porte->x6 = false; > } > > void set_flag_bad(bool flag) > { > porte->x6 = flag; > } > > > : > 0: 88 23 and r24, r24 > 2: 01 f4 brne .+0 ; 0x4 > 2: R_AVR_7_PCREL .text+0x8 > 4: 1e 98 cbi 0x03, 6 ; 3 > 6: 08 95 ret > 8: 1e 9a sbi 0x03, 6 ; 3 > a: 08 95 ret > > 000c : > c: 81 70 andi r24, 0x01 ; 1 > e: 82 95 swap r24 > 10: 88 0f add r24, r24 > 12: 88 0f add r24, r24 > 14: 80 7c andi r24, 0xC0 ; 192 > 16: 93 b1 in r25, 0x03 ; 3 > 18: 9f 7b andi r25, 0xBF ; 191 > 1a: 98 2b or r25, r24 > 1c: 93 b9 out 0x03, r25 ; 3 > 1e: 08 95 ret >
Re: i370 port - music/sp - possible generic gcc problem
If GC does that, then how come there is all this effort to do mmap testing to see if it has the facility to zero memory, and I can't see what you are refering to. I obviously misinterpreted that then. why is the surrounding code (in GCC 4.4's alloc_page()) calling XCNEWVEC instead of XNEWVEC? That's the page table entries, not the data itself. There wouldn't be the need for ggc_alloc_cleared if ggc_alloc would already zero pages. Ok, based on this, I traced it back further: rtx gen_rtx_fmt_e0 (code, mode, arg0) RTX_CODE code; enum machine_mode mode; rtx arg0; { rtx rt; rt = ggc_alloc_rtx (2); memset (rt, 0, sizeof (struct rtx_def) - sizeof (rtunion)); rtx gen_rtx_MEM (mode, addr) enum machine_mode mode; rtx addr; { rtx rt = gen_rtx_raw_MEM (mode, addr); /* This field is not cleared by the mere allocation of the rtx, so we clear it here. */ MEM_ATTRS (rt) = 0; return rt; } void init_caller_save () { ... for (i = 0; i < FIRST_PSEUDO_REGISTER; i++) for (mode = 0 ; mode < MAX_MACHINE_MODE; mode++) if (HARD_REGNO_MODE_OK (i, mode)) { rtx mem = gen_rtx_MEM (mode, address); And indeed, this code - caller_save, shows up as one of the symptoms. One of the symptoms is that where it is meant to be saving a register prior to a function call, it instead generates a completely stupid instruction - and that stupid instruction happens to be the first instruction listed in i370.md. Right at the moment the symptom is: /ID SAVE-JOB-123456 @BLD000 /SYS REGION=,XREGION=64M /FILE SYSPRINT PRT OSRECFM(F) OSLRECL(256) /FILE SYSIN N(MATH.C) SHR /FILE O N(MATH.S) NEW RECFM(V) SP(1000) /FILE SYSINCL PDS(*.H) /FILE INCLUDE PDS(*.H) /PARM -Os -S -DUSE_MEMMGR -DMUSIC -o dd:o - /LOAD XMON : In function `acos': :137: Internal compiler error in ?, at :724 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html> for instructions. compiling something from my C runtime library. But I basically get different behaviour depending on what I initialize uninitialized memory to. Any ideas on what to look for now? I can probably isolate which bit of the malloc()ed memory is being referenced before initialization. But finding out who is doing the reference could be tricky. I might be able to trace it from a different approach, getting more information about that internal error, now that I know where some of the involved memory is. I'll get that converted into a PC filename first. BFN. Paul.
Re: avr: optimizing assignment to a bit field
2009/11/28 Richard Guenther : > On Sat, Nov 28, 2009 at 11:43 PM, Shaun Jackman wrote: >> When assigning a bool to a single bit of a bitfield located in the >> bit-addressable region of memory, better code is produced by >> if (flag) >> bitfield.bit = true; >> else >> bitfield.bit = false; >> than >> bitfield.bit = flag; >> >> I've included a short test and the assembler output by both forms. >> Should I file a bug suggesting a possible improvement here? > > Yes, a bugreport is useful - but there might be a bug with this issue > already. Reported here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42210 Cheers, Shaun
Re: i370 port - music/sp - possible generic gcc problem
: In function `acos': :137: Internal compiler error in ?, at :724 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html> for instructions. I might be able to trace it from a different approach, getting more information about that internal error, now that I know where some of the involved memory is. I'll get that converted into a PC filename first. 6 hours of compilation later, I was unsuccessful in getting the proper filename displayed. As far as I can tell, it's aborting but not displaying any output. ie randomly displaying the above message. However, not to worry, since there's only one line 724 with an abort() in it, and it's this bit of code: static int insert_save (chain, before_p, regno, to_save, save_mode) struct insn_chain *chain; int before_p; int regno; HARD_REG_SET *to_save; enum machine_mode *save_mode; { int i; unsigned int k; rtx pat = NULL_RTX; int code; unsigned int numregs = 0; struct insn_chain *new; rtx mem; /* A common failure mode if register status is not correct in the RTL is for this routine to be called with a REGNO we didn't expect to save. That will cause us to write an insn with a (nil) SET_DEST or SET_SRC. Instead of doing so and causing a crash later, check for this common case and abort here instead. This will remove one step in debugging such problems. */ if (regno_save_mem[regno][1] == 0) abort (); which is in the same file as the init_caller_save() function that allocated the memory in the first place. One fortunate thing is that this source file is under 1000 lines long so hopefully amenable to debugging. BFN. Paul.