Re: compiling very large functions.

2006-11-05 Thread Geert Bosch


On Nov 5, 2006, at 08:46, Kenneth Zadeck wrote:

The thing is that even as memories get larger, something has to give.
There are and will always be programs that are too large for the most
aggressive techniques and my proposal is simply a way to gracefully  
shed

the most expensive techniques as the programs get very large.

The alternative is to just to just shelve these bugs and tell the
submitter not to use optimization on them.  I do not claim to know  
what

the right approach is.


For compiling very large programs, lower optimization levels
(-O1 -Os) already tend to be competitive with -O2. More often
than not, -O3 does not improve results or even results in
slower code. Performance becomes dominated by how much code
can fit in L2, TLB or RAM.

Ideally, we would always compile code that is executed infrequently
using -Os to minimize memory footprint, and always compile code
in loops with many iterations using high optimization levels.

Kenneth's proposal to trigger different optimization strategies
in each function based on certain statistics seems an excellent
step to allow compilations to be more balanced. This does not
only help in reducing compile time (mostly through reduced
memory usage), but also may improve generated code.

For the rare program with huge performance-critical functions,
we can either add user-adjustable parameters, new optimization
levels, or use profile-feedback to find out about such
programs.

For most programs though, more balanced optimization allows
the compiler to aggressively optimize code that matters, and
minimizing code size and compile-time for large swaths of
code that are uninteresting.

  -Geert


Re: Threading the compiler

2006-11-10 Thread Geert Bosch

Most people aren't waiting for compilation of single files.
If they do, it is because a single compilation unit requires
parsing/compilation of too many unchanging files, in which case
the primary concern is avoiding redoing useless compilation.

The common case is that people just don't use the -j feature
of make because
  1) they don't know about it
  2) their IDE doesn't know about it
  3) they got burned by bad Makefiles
  4) it's just too much typing

Making single compilations more complex through threading
seems wrong. Right now, in each compilation, we invoke the
compiler driver (gcc), which invokes the front end and
then the assembler. All these processes need to be
initialized, need to communicate, clean up etc.
While one might argue to use "gcc -pipe" for more parallelism,
I'd guess we win more by writing object files directly to disk
like virtually every other compiler on the planet.

Just compiling
  int main() { puts ("Hello, world!"); return 0; }
takes 342 system calls on my Linux box, most of them
related to creating processes, repeated dynamic linking,
and other initialization stuff, and reading and writing
temporary files for communication.

For every instruction processed, we call printf
to produce nicely formatted output with decimal operands
which later gets parsed again into binary format.
Ideally, we'd just do one read of the source and
one write of the object. Then we'd have far below
100 system calls for the entire compilation.

Most of my compilations (on Linux, at least) use close
to 100% of CPU. Adding more overhead for threading and
communication/synchronization can only hurt.

  -Geert


Re: Threading the compiler

2006-11-13 Thread Geert Bosch

On Nov 11, 2006, at 03:21, Mike Stump wrote:
The cost of my assembler is around 1.0% (ppc) to 1.4% (x86)  
overhead as measured with -pipe -O2 on expr.c,.  If it was  
converted, what type of speedup would you expect?


Given that CPU usage is at 100% now for most jobs, such as
bootstrapping GCC, there is not much room for any improvement
through threading. Even in the best case, large parts of the
compilation will still be serial.
In the non-optimizing case, which is so important for the
compile-debug-edit cycle, almost no parallelism will be possible.

With LTO, all the heavy lifting will be done at link
time, with the initial compilation stripped down to
the essentials. Writing out the intermediate representation
to .o files directly instead of going through assembly
may make more of a difference then.

Without invoking the assembler, the number of minor page faults
gets reduced by about 10% on Linux.
Costs associated with system calls and page faults tend to
not scale very well with higher numbers of processors and
parallel tasks. So, I think the current approach by using
very course granularity parallelism is most efficient.
Also, on systems such as Windows and (cough) OpenVMS,
spawning processes and performing I/O tend to be more
heavyweight.

The main place where threading may make sense, especially
with LTO, is the linker. This is a longer lived task, and
is the last step of compilation, where no other parellel
processes are active. Moreover, linking tends to be I/O
intensive, so a number of threads will likely be blocked
for I/O.

  -Geert


Re: Threading the compiler

2006-11-13 Thread Geert Bosch


On Nov 13, 2006, at 21:27, Dave Korn wrote:
  To be fair, Mike was talking about multi-core SMP, not threading  
on a single
cpu, so given that CPU usage is at 100% now for most jobs, there is  
an Nx100%

speedup to gain from using 1 thread on each of N cores.


I'm mostly building GCC on multiprocessor CPUs and typically
run make -j (number_of_processors + 1).



The main place where threading may make sense, especially
with LTO, is the linker. This is a longer lived task, and
is the last step of compilation, where no other parellel
processes are active. Moreover, linking tends to be I/O
intensive, so a number of threads will likely be blocked
for I/O.


  I'm not really sure how this would play with SMP (as opposed to  
threading).
I don't see why you think threading could be particularly useful in  
the
linker?  It's the pipeline of compiler optimisation passes that  
looks like an

obvious candidate for threading to me.


This would be when we do link-time optimizations. These optimizations
will be on much larger datasets and occur when with no work remaining
to do in parallel.

  -Geert


Re: Threading the compiler

2006-11-14 Thread Geert Bosch


On Nov 14, 2006, at 12:49, Bill Wendling wrote:

I'll mention a case where compilation was wickedly slow even
when using -j#. At The MathWorks, the system could take >45 minutes
to compile. (This was partially due to the fact that the files were
located on an NFS mounted drive. But also because C++ compilation
is way slower than C compilation.) Experiments with distributed
make showed promise.


When using -j#, with enough parellel processes and assuming  
sufficient memory,

you should be able to reach the point where you're either limited
by the NFS server or by CPU availability. No amount of threading in the
compiler will remove either bottleneck.

  -Geert


Re: MPFR precision when FLT_RADIX != 2

2006-12-04 Thread Geert Bosch


On Dec 3, 2006, at 12:44, Kaveh R. GHAZI wrote:


In case i370 support is revived or a format not using base==2 is
introduced, I could proactively fix the MPFR precision setting for any
base that is a power of 2 by multiplying the target float precision by
log2(base).  In the i370 case I would multiply by log2(16) which is 4.
When base==2, then the log2(2) is 1 so the multiplication  
simplifies to

the current existing behavior.


That would not be correct, as the actual precision in bits of
a base==16 floating-point number depends on the magnitude of the number.
The gaps between adjacent hexadecimal floating-point numbers can be
2, 4 or 8 times that of binary floats with a mantissa of the same
number of bits (including any implicit leading 1's).

Example: In a floating-point format with 24 binary digits, the even
integers 0x102 through 0x10e would be representable, while
a system with 6 hexadecimal digits there would be a gap between
0x100 and 0x110.

So, while the sum 16777216.0 + 2.0 does not depend on rounding
direction in IEEE single precision math, it would depend on
rounding direction for the IBM 370's single precision type.

For GCC's purpose, it seems that hexadecimal floating-point systems
can be regarded as a historical curiosity and adding significant
complexity for supporting some optimizations for them seems not
worth the distributed cost of maintenance. For IBM 370 math, we
should always either:
  - call library functions for evaluation
  - convert to IEEE, operate, convert back

  -Geert



Re: Gfortran and using C99 cbrt for X ** (1./3.)

2006-12-05 Thread Geert Bosch


On Dec 4, 2006, at 20:19, Howard Hinnant wrote:
If that is the question, I'm afraid your answer is not accurate.   
In the example I showed the difference is 2 ulp.  The difference  
appears to grow with the magnitude of the argument.  On my systems,  
when the argument is DBL_MAX, the difference is 75 ulp.


pow(DBL_MAX, 1./3.) = 0x1.428a2f98d7240p+341
cbrt(DBL_MAX)   = 0x1.428a2f98d728bp+341

And yes, I agree with you about the C99 standard.  It allows the  
vendor to compute pretty much any answer it wants from either pow  
or cbrt.  Accuracy is not mandated.  And I'm not trying to mandate  
accuracy for Gfortran either.  I just had a knee jerk reaction when  
I read that pow(x, 1./3.) could be optimized to cbrt(x) (and on re- 
reading, perhaps I inferred too much right there).  This isn't just  
an optimization.  It is also an approximation.  Perhaps that is  
acceptable.  I'm only highlighting the fact in case it might be  
important but not recognized.


This is really a very similar case as using a fused multiply-add
instead of two separate instructions with a separate rounding.
In computing cbrt(x) instead of pow(x, 1.0/3.0), we omit the rounding
of 1.0/3.0 and only round after computing the cube root.

As you should for IEEE double, for cbrt, the relative difference is
still quite small. For fused multiply-add, cancellation can make
the relative difference as arbitrarily large.

Still this is not nearly as bad as the re-association of summands,
which throws away any hope of being able to determine accuracy of
computations due to catastrophic cancellation. This is the
one of the biggest issues with -funsafe-math-optimizations.

Maybe we should have an option between IEEE math and
-funsafe--math-optimizations that says for the standard
arithmetic operations *, / and sqrt, the result can either
be correctly rounded floating-point value or the exact result?
This would allow useful optimizations such as
   X * 4.0 / 2.0 -> X + X  (otherwise invalid in case of overflow)
   X / 2.0 * 4.0 -> X + X  (otherwise invalid in case of overflow)
   X * Y + Z -> fma(X, Y, Z)
   X + Y - Y -> X
   X / (1.0 / Y) -> X * Y
   pow((sqrt (X), 2.0) -> X
   sqrt (X * X) -> X
   pow (x, 1.0 / 3.0)  -> cbrt (x)
In addition we should have a built-in identity function that
only explicitly evaluate its argument to a correctly rounded
floating-point number. That would allow users (or front ends)
to selectively avoid contractions where necessary. Currently
the only brute force workaround is using volatile variables.
It would be great to keep the value in registers instead,
but still prevent other clever optimizations from wreaking havoc.

  -Geert


Re: -fwrapv enables some optimizations

2006-12-20 Thread Geert Bosch


On Dec 20, 2006, at 09:38, Bruno Haible wrote:

But the other way around? Without -fwrapv the compiler can assume more
about the program being compiled (namely that signed integer overflows
don't occur), and therefore has more freedom for optimizations. All
optimizations that are possible with -fwrapv should also be performed
without -fwrapv. Anything else is a missed optimization.


This is completely wrong. Making operations undefined is a two-edged
sword. At the one hand, you can make more assumptions, but there's
also the issue that when you want to rewrite expressions, you have
to be more careful to not introduce undefined behavior when there
was none before.

The canonical example is addition of signed integers. This operation
is associative with -wrapv, but not without.

So
  a = b + C1 + c + C2;
could be written as
  a = b + c + (C1 + C2);
where the constant addition is performed at compile time.
With signed addition overflowing, you can't do any reassociation,
because this might introduce overflows where none existed before.

Probably we would want to lower many expressions to unsigned
eventually, but the question of when and where to do it emphasizes
that you can only take advantage of undefined behavior if you make
sure you don't introduce any.

Sometimes I think it is far better to have a default of -fwrapv for
at -O1 and possibly -Os. Sure, this would disable some powerful
optimizations, especially those involving loops, but it would in
practise be very useful to get reasonably good optimization for programs
with minimizing the number of programs with undefined behavior.
Also, it would allow some new optimizations, so total loss of
performance may be quite acceptable.

As -fwrapv only transforms programs with undefined behavior into
programs with implementation-defined behavior, so nobody can
possibly complain about their programs suddenly doing something
different.

Also, for safety-critical program and certification, it is essential
to be able to reason about program behavior. Limiting the set of
programs with erroneous or undefined execution is essential.
If you want to prove that a program doesn't cause undefined behavior,
it is very helpful signed integer overflow to be defined, even if
it's just implementation defined. That would be a huge selling-point
for GCC.

  -Geert


Re: g++ doesn't unroll a loop it should unroll

2006-12-29 Thread Geert Bosch


On Dec 13, 2006, at 17:09, Denis Vlasenko wrote:


# g++ -c -O3 toto.cpp -o toto.o
# g++ -DUNROLL -O3 toto.cpp -o toto_unroll.o -c
# size toto.o toto_unroll.o
   textdata bss dec hex filename
525   8   1 534 216 toto.o
359   8   1 368 170 toto_unroll.o

How can C++ compiler know that you are willing to trade
so much of text size for performance?


Huh? The unrolled version is 30% smaller, isn't it?

  -Geert



Re: changing "configure" to default to "gcc -g -O2 -fwrapv ..."

2007-01-01 Thread Geert Bosch


On Dec 31, 2006, at 19:13, Daniel Berlin wrote:

Note the distinct drop in performance across almost all the benchmarks
on Dec 30, including popular programs like bzip2 and gzip.

Not so.

To my eyes, the specint 2000 mean went UP by about 1% for the
base -O3 compilation. The peak enabled more unrolling, which
is helped by additional range information provided by absence
of -frwapv.

So, I'd say this run would suggest enabling -fwrapv for
at least -O1 and -O2. Also, note that we never have
focussed on performance with -fwrapv, and it is quite
likely there is quite some improvement possible.

I'd really like using -fwrapv by default for -O, -O[s12].
The benefit of many programs moving from "undefined semantics"
to "implementation-defined semantics, overflow wraps like in
old compilers" far outweighs even an average performance loss
of 2% as seen in specfp.

As undefined execution can result in arbitrary badness,
this is really at odds with the increasing need for many
programs to be secure. Since it is almost impossible to
prove that programs do not have signed integer overflow,
it makes far more sense to define behavior in such cases.
Note that were talking defaults: for not-so-sophisticated
programmers, we should focus on being safe. People smart
enough to proof their program can't cause signed integer
overflow, can certainly figure out compiler options to
disable -fwrapv.

  -Grt



Re: changing "configure" to default to "gcc -g -O2 -fwrapv ..."

2007-01-01 Thread Geert Bosch


On Jan 1, 2007, at 12:16, Joseph S. Myers wrote:
For a program to be secure in the face of overflow, it will  
generally need
explicit checks for overflow, and so -fwrapv will only help if such  
checks

have been written under the presumption of -fwrapv semantics.


Yes, but often people do write such defensive code,
and many such checks now get removed by gcc.

If I compute some value, I may check the result
before accessing an array or similar. Such local
defenses are of no use with current gcc without
-fwrapv.

  -Grt


Re: changing "configure" to default to "gcc -g -O2 -fwrapv ..."

2007-01-02 Thread Geert Bosch

On Jan 1, 2007, at 21:14, Ian Lance Taylor wrote:

[...]
extern void bar (void);
void
foo (int m)
{
  int i;
  for (i = 1; i < m; ++i)
{
  if (i > 0)
bar ();
}
}

Here the limit for i without -fwrapv becomes (1, INF].  This enables
VRP to eliminate the test "i > 0".  With -fwrapv, this test of course
can not be eliminated.  VRP is the only optimization pass which is
able to eliminate that test.


We should be able to optimize this even for -fwrapv.
For i = 0, the loop will not execute at all, for other
i, execution will stop when i == m, after m - 1 executions
of the loop body. The condition i > 0 will never be true,
regardless of -fwrapv as there won't be overflow.

  -Grt



Re: Find the longest float type nodes

2007-03-19 Thread Geert Bosch

On Mar 19, 2007, at 05:44, François-Xavier Coudert wrote:

I have the three following questions, probably best directed to
middle-end experts and Ada maintainers:

 * How can I know the longest float type? My first patch uses the
long_double_type_node unconditionally, but it surely isn't a generic
solution

In particular, that has the problem of long double often being a
type that is implemented in software only. For Ada, we use
WIDEST_HARDWARE_FP_SIZE if defined, or LONG_DOUBLE_TYPE_SIZE
otherwise.

Using the widest hardware type will at least guarantee that
the hardware doesn't implicitly use a wider type. However, it's
unnecessarily conservative: on x86 with -mfpmath=sse for example,
we could use any type without problems. Similarly, on most
architectures other than x86 and PPC, extra precision is never used.


 * How can I determine if a given type may have extra precision?

We should somehow add a new predicate that would give this information
per type. Something like FP_TYPE_ROUNDS_P that is true for any type
that always rounds results might be useful. The predicate should
probably default to TRUE, and be overridden for single and double
precision types on non-SSE x86 and x86-64, as well as single precision
types for some PPC configurations.


 * What is this padding code doing, and is it necessary?

The Ada front end sometimes adds padding to a type if an explicit
alignment is requested (through use of attributes in the source code)
that is higher than the default. As the FIXME comment indicates,
this should be going away. Don't worry about it for Fortran.

  -Geert


Re: GIMPLE tuples document uploaded to wiki

2007-04-25 Thread Geert Bosch

In 3.1, you write:

The statistics gathered over the programs mentioned in the
previous section show that about 43% of all statements contain 0 or  
more

register operands


I'd assume 100% contain 0 or more register operands.
Did you mean 43% contain 1 or more?

  -Geert


Re: Performance analysis of Polyhedron/gas_dyn

2007-04-27 Thread Geert Bosch


On Apr 27, 2007, at 06:12, Janne Blomqvist wrote:
I agree it can be an issue, but OTOH people who care about  
precision probably 1. avoid -ffast-math 2. use double precision  
(where these reciprocal instrs are not available). Intel calls it - 
no-prec-div, but it's enabled for the "-fast" catch-all option.


On a related note, our beloved competitors generally have some high  
level flag for combining all these fancy and potentially unsafe  
optimizations (e.g. -O4, -fast, -fastsse, -Ofast, etc.). For gcc,  
at least FP benchmarks seem to do generally well with something  
like "-O3 -funroll-loops -ftree-vectorize -ffast-math -march=native  
-mfpmath=sse", but it's quite a mouthful.


No, using only 12 bits of precision is just ridiculous and should
not be included in -ffast-math. You should always use a Newton-Rhapson
step after getting the 12-bit approximation. When done correctly
this doubles the precision and gets you just about the 24 bits of
precision needed for float. Reciprocal approximations are meant
to be used that way, and it's no accident the lookup provides
exactly half the bits needed. For double precision you just do
two more iterations, which is why there is no need for double
precision variants of these instructions.

The cost for the extra step is small, and you get good results.
There are many variations possible, and using fused-multiply add
it's even possible to get correctly rounded results at low cost.
I truly doubt that any of the compilers you mention use these
instructions without NR iteration to get required precision.

  -Geert


Re: Using fold() in frontends

2005-03-07 Thread Geert Bosch
On Mar 7, 2005, at 12:40, Giovanni Bajo wrote:\
But how are you proposing to handle the fact that the C++ FE needs to 
fold
constant expressions (in the ISO C++ sense of 'constant expressions)? 
For
instance, we need to fold "1+1" into "2" much before gimplification. 
Should
a part of fold() be extracted and duplicated in the C++ frontend?
Yes. As we've found out over and over again, the semantics of fold in 
the
various front ends and the middle end is not the same. While it is nice 
to
share code, when we end up placing an arbitrary restriction on our 
selves that
the same tool must be used for both hammering nails and driving screws, 
any
advantage of code reuse quickly disappears.

So, even if we end up copying a lot of the code, that's better than 
forced sharing.
After a while, both copies will indeed diverge and you'll end up with a 
good hammer
and a good screwdriver. Right now we're in a situation where 
improvement of one
use of fold, causes problems for its other use.

  -Geert


Re: [OT] __builtin_cpow((0,0),(0,0))

2005-03-10 Thread Geert Bosch
On Mar 9, 2005, at 03:18, Duncan Sands wrote:
if the Ada front-end has an efficient, accurate implementation
of x^y, wouldn't it make sense to move it to the back-end
(__builtin_pow) so everyone can benefit?
Does not have it yet. Current implementation is reasonably accurate,
but not very fast. However, I'm working on a rewrite of these functions
that match Ada strict mode accuracy requirements and is reasonably fast.
However, there is too much difference between C and Ada requirements
for the power function for it to make sense to share the implementation.
Ada requires raising exception in certain circumstances, but doesn't 
care
about errno. Also, for Ada the rounding mode always is round-to-even.

Finally, even though the Ada math library conforms to Ada strict mode
requirements (typically 2 or 4 eps), it doesn't try to have these 
functions
be correctly rounded. This allows significant simplifications of the
function evaluation, making it efficient to implement even on systems
without support for fused multiply-add.

  -Geert


Re: converting Ada to handle USE_MAPPED_LOCATION

2005-03-19 Thread Geert Bosch
Hi Per,
Of the three proposals:
[...]
The ideal solution I think is for Ada to use line-map's 
source_location for Sloc in its lexer.
[...]
translate Sloc integers to source_location
when we translate the Ada intermal format to Gcc trees.
[...]
the location_t in the shared Gcc should be a language-defined opaque 
time,
and have language call-backs.
The first one really is out for the Ada maintainers, as this would 
couple
the front end far too tightly to the back end and lose the nice 
property that
the exact same front end sources (excluding the few C files for tree 
conversion)
can be used for any GCC back end from 2.8.1 to 4.1 without 
modifications, as
well as non-GCC back ends.

The second one wouldn't be my preferred choice, as it adds complexity 
for no
gain, but because the code can remain localized to the few C files 
interfacing
the front end to the back end, this would be acceptable.

The last would be far preferred, as it would not tie in front ends so 
much to the back
end, while still allowing sharing of the line map implementation when 
desired. As it both
seems easiest to implement, and cleanest. Clean separation from the 
back end is important
for languages maintained outside the GCC tree.

  -Geert


Re: Ada and ARM build assertion failure

2005-03-21 Thread Geert Bosch
On Mar 21, 2005, at 02:54, Nick Burrett wrote:
This seems to be a reoccurance of PR5677.
I'm sorry, but I can't see any way this is related, could you elaborate?
for Aligned_Word'Alignment use
- Integer'Min (2, Standard'Maximum_Alignment);
+ Integer'Min (4, Standard'Maximum_Alignment);
This patch is wrong, as it implicitly increases the size of 
Aligned_Word from
2 to 4 bytes: size is always a multiple of the alignment.
However, it is really dubious you need to change this package, as it is 
only
used for DEC Ada compatibility on VMS systems.

  -Geert


Re: Ada and ARM build assertion failure

2005-03-21 Thread Geert Bosch
On Mar 21, 2005, at 11:02, Nick Burrett wrote:
OK, but if I don't apply the patch, GNAT complains that the alignment 
should be 4, not 2 and compiling ceases.
Yes, this is related to PR 17701 as Arno pointed out to me in a private 
message.
Indeed, the patch you used works around this failure and can be used as 
a kludge.
Properly disabling building this package would be better, but there 
isn't a mechanism
for that yet.

However, this all is entirely unrelated to the failure you're seeing.


Re: converting Ada to handle USE_MAPPED_LOCATION

2005-03-23 Thread Geert Bosch
On Mar 22, 2005, at 22:09, Per Bothner wrote:
Of course that's in the eye of the beholder.  I think a local 
translation
is cleaner and more robust/safer than a global opaque type/call-back.
OK, let's go with that approach then.
  -Geert


Bootstrap error on powerpc-apple-darwin: stfiwx

2005-03-27 Thread Geert Bosch
%cat LAST_UPDATED
Sat Mar 26 21:31:28 EST 2005
Sun Mar 27 02:31:28 UTC 2005
stage1/xgcc -Bstage1/ -B/opt/gcc-head//powerpc-apple-darwin7.8.0/bin/ 
-c   -g -O2 -mdynamic-no-pic -DIN_GCC   -W -Wall -Wwrite-strings 
-Wstrict-prototypes -Wmissing-prototypes -pedantic -Wno-long-long 
-Wno-variadic-macros -Wold-style-definition -Werror-DHAVE_CONFIG_H  
  -I. -I. -I/Users/bosch/gcc/gcc -I/Users/bosch/gcc/gcc/. 
-I/Users/bosch/gcc/gcc/../include -I./../intl 
-I/Users/bosch/gcc/gcc/../libcpp/include -I/opt/include -I/opt/include 
/Users/bosch/gcc/gcc/c-cppbuiltin.c -o c-cppbuiltin.o
/var/tmp//ccsCOTn9.s:874:stfiwx instruction is optional for the PowerPC 
(not allowed without -force_cpusubtype_ALL option)
/var/tmp//ccsCOTn9.s:924:stfiwx instruction is optional for the PowerPC 
(not allowed without -force_cpusubtype_ALL option)
/var/tmp//ccsCOTn9.s:970:stfiwx instruction is optional for the PowerPC 
(not allowed without -force_cpusubtype_ALL option)
/var/tmp//ccsCOTn9.s:997:stfiwx instruction is optional for the PowerPC 
(not allowed without -force_cpusubtype_ALL option)
make[2]: *** [c-cppbuiltin.o] Error 1
make[1]: *** [stage2_build] Error 2
make: *** [bootstrap] Error 2

Likely offending patch:
2005-03-25  Geoffrey Keating  <[EMAIL PROTECTED]>
* config/rs6000/darwin-fallback.c: Don't include .
Use our own structure definitions.
* config/rs6000/rs6000.md (UNSPEC constants): Add UNSPEC_STFIWX.
(fix_truncdfsi2): Allow registers or memory as destination.
When TARGET_PPC_GFXOPT, generate simplified pattern.
(fix_truncdfsi2_internal): Use define_insn_and_split.
(fix_truncdfsi2_internal_gfxopt): New.
(fctiwz): Don't confuse register allocation by giving it no 
choices.
(stfiwx): New.
* config/rs6000/rs6000.h (EXTRA_CONSTRAINT): Add 'Z'.
(EXTRA_MEMORY_CONSTRAINT): Likewise.
* config/rs6000/rs6000.c (indexed_or_indirect_operand): New.
* config/rs6000/rs6000-protos.h (indexed_or_indirect_operand): 
New.



Re: RFC: #pragma optimization_level

2005-04-03 Thread Geert Bosch
On Apr 1, 2005, at 16:36, Mark Mitchell wrote:
In fact, I've long said that GCC had too many knobs.
(For example, I just had a discussion with a customer where I 
explained that the various optimization passes, while theoretically 
orthogonal, are not entirely orthogonal in practice, and that truning 
on another pass (GCSE, in this caes) avoided other bugs.  For that 
reason, I'm not actually convinced that all the -f options for turning 
on and off passes are useful for end-users, although they are clearly 
useful for debugging the compiler itself.  I think we might have more 
satisfied users if we simply had -Os, -O0, ..., -O3.  However, many 
people in the GCC community itself, and in certain other vocal areas 
of the user base, do not agree.)
Pragmas have even more potential for causing problems than command-line 
options.
People are generally persuaded more easily to change optimization 
options, than
to go through hundreds of source files fixing pragmas.

As the average life of a piece of source code is far longer than the 
life-span
of a specific GCC release, users expect to compile unchanged source 
code with
many different compilers. For this reason, I think it is big mistake to
allow pragmas to turn on or off individual passes. The internal 
structure
of the compiler changes all the time, and pragmas written for one 
version
may not make sense for another version.

The effect will be that over time, user pragmas are wrong more often 
than
right, and the compiler will often do better when just ignoring them 
all together.
(This is when people will ask for a 
-fignore-source-optimization-pragmas flag.)
Pressure on GCC developers to maintain compatibility with old flags 
will increase
as well. This is a recipe for disaster.

I think arguments for optimization level should have the following 
properties:
  - Obvious meaning, independent of compiler "brand" and version
  - Similar or identical to widely used pragmas in other compilers
  - Only broad definitions, to prevent over specification
  - Pragmas are only hints: the compiler may decide to honor them or not

Most cases where pragmas would be used profitably are caused by
deficiencies in the compiler. Future improvements will make the hints
gradually obsolete. Broad classifications such as "optimize size" or
"don't optimize" will stay useful longest. Very specific options
such as "optimize using the first scheduler pass" will be obsolete
very fast and are not meaningful across a range of compilers.
optimization control seems to be something lots of people really want, 
and other compilers do offer it.  I think it can be particularly 
useful to people who want to work around compiler bugs in a particular 
routine, without refactoring their code, or losing all optimization 
for a translation unit.
There really are two parts to this:
  1. Infra-structure in the compiler to allow for varying optimization 
levels,
 such as function attributes, or even finer granularity

  2. Syntax for specifying the options in source code
The first is even useful without the last. For example, it would be 
useful
for the compiler to automatically optimize for size, if it is known at 
compile
time that a specific function will only be executed once. As the 
compilers
estimates for execution frequency improve, most people would want the 
compiler
to use balanced optimization where "hot" functions are optimized for 
speed
and cold functions are optimized for size. Here, it would even be useful
to have granularity per basic block rather than per function, so that
different optimization is in effect for cold and hot sections.

The syntax part is similar to the "register" keyword in C, which might 
have
been useful at one point, but now is mostly ignored and only retained 
for
compatibility. But doing anything much more elaborate than optimization
(off, size, some, all, inlining) corresponding to (-O0, Os, O1, O2, O3)
on a per-function basis seems a bad idea.

  -Geert


Re: Inline round for IA64

2005-04-07 Thread Geert Bosch
As far as I can seem from this patch, it rounds incorrectly.
This is a problem with the library version as well, I believe.
The issue is that one cannot round a positive float to int
by adding 0.5 and truncating. (Same issues with negative values
and subtracting 0.5, of course). This gives an error for the
predecessor of 0.5. The between Pred (0.5) and 0.5 is half that of
pred (1.0) and 1.0. So the value of Pred (0.5) + 0.5 lays exactly
halfway Pred (1.0) and 1.0. The CPU rounds this halfway value to
even, or 1.0 in this case.
So try rounding .4999444888487687421729788184165954589843750
using IEEE double on non-x86 platform, and you'll see it gets rounded 
to 1.0.
A similar  problem exists with large odd integers between 2^52+1 and 
2^53-1,
where adding 0.5 results in a value exactly halfway two integers,
rounding up to the nearest even integer. So, for IEEE double,
4503599627370497 would round to 4503599627370498.

These issues can be fixed by not adding/subtracting 0.5, but Pred (0.5).
As shown above, this rounds to 1.0 correctly for 0.5. For larger values
halfway two integers, the gap with the next higher representable number 
will
only decrease so the result will always be rounded up to the next higher
integer. For this technique to work, however, it is necessary that the
addition will be rounded to the target precision according to IEEE
round-to-even semantics. On platforms such as x86, where GCC implicitly
widens intermediate results for IEEE double, the rounding to integer
should be performed entirely in long double mode, using the long double
predecessor of 0.5.

See ada/trans.c around line 5340 for an example of how Ada does this.
  -Geert
On Apr 7, 2005, at 05:38, Canqun Yang wrote:
Gfortran translates the Fortran 95 intrinsic DNINT to
round operation with double precision type argument
and return value. Inline round operation will speed up
the SPEC CFP2000 benchmark 189.lucas which contains
function calls of intrinsic DNINT from 706 (SPEC
ratio) to 783 on IA64 1GHz system.
I have implemented the double precison version of
inline round. If it is worth doing, I can go on to
finish the other precision mode versions.



Re: Inline round for IA64

2005-04-07 Thread Geert Bosch
On Apr 7, 2005, at 10:12, Steve Kargl wrote:
On Thu, Apr 07, 2005 at 08:08:15AM -0400, Geert Bosch wrote:
As far as I can seem from this patch, it rounds incorrectly.
This is a problem with the library version as well, I believe.
Which library?
libgfortran, or whatever is used to implement NINT and DNINT.
Here's an example:
  program main
  real x, y
  x = 8388609.0
  y = 0.499701976776123046875
  print *, 'nint (', x, ') =', nint (x)
  print *, 'nint ( y ) =', nint (y), ', where y < 0.5 = ', y < 0.5
  end
output is
 nint (   8388609. ) = 8388610
 nint ( y ) =   1 , where y < 0.5 =  T


Re: Inline round for IA64

2005-04-07 Thread Geert Bosch
On Apr 7, 2005, at 13:27, Steve Kargl wrote:
Try -fdump-parse-tree.  You've given more digits in y than
its precision.  This is permitted by the standard.  It appears
the gfortran frontend is taking y = 0.49 and the closest
representable nubmer is y = 0.5.
So, why does the test y < 0.5 yield true then?


Re: Inline round for IA64

2005-04-07 Thread Geert Bosch
On Apr 7, 2005, at 13:54, Steve Kargl wrote:
I missed that part of the output.  The exceeding
long string of digits caught my attention.  Can
you submit a PR?
These routines should really be done as builtins, as almost all
front ends need this facility and we'd fit in with the common
frameworks for folding etc. The only reason I haven't done this
so far, is that right now there is no general language-indepent
framework for declaring builtins.
Basically, every front end has to set up typed and declarations
for every builtin, which is a bit of a pain, especially since there
are many floating-point formats and integer formats that we
might want to convert between.
  -Geert


Re: Stickiness of TYPE_MIN_VALUE/TYPE_MAX_VALUE

2005-05-31 Thread Geert Bosch


On May 30, 2005, at 16:50, Florian Weimer wrote:


I'll try to phrase it differently: If you access an object whose bit
pattern does not represent a value in the range given by
TYPE_MIN_VALUE .. TYPE_MAX_VALUE of the corresponding type, does this
result in erroneous execution/undefined behavior?  If not, what is the
exact behavior WRT to out-of-bounds values?



This is correct. Note that this is only valid for objects,
in expressions intermediate values may lay outside the range
of the type.


Re: Is it possible to catch overflow in long long multiply ?

2005-06-03 Thread Geert Bosch


On May 30, 2005, at 02:57, Victor STINNER wrote:

I'm using gcc "long long" type for my calculator. I have to check
integer overflow. I'm using sign compare to check overflow, but it
doesn't work for 10^16 * 10^4 :
  1 * 1


I see your question went unanswered, however I do think it is
relevant to the development of (not with) GCC. The ongoing
discussion regarding "Ada front-end depends on signed overflow"
focuses on Ada, but the same issues are true for C and C++ as well.

I think this may be an other case where GCC's increased reasoning
from the notion that signed integer overflow invokes undefined
behavior is harmful. True, your program invokes undefined behavior
according to the C standard, and as a result the behavior of the
code in presence of overflow is conforming to the C standard.

However, this case also shows how much 2s-complement arithmetic
is ingrained in our brains. Your real-world example, highlights
the fact that allowing GCC to reason from absence of overflows
breaks useful code, in effect removing the overflow detection
code, just like it removes Ada range checks.

Defaulting to -fwrapv for all compilations in order to get the
expected wraparound behavior significantly increases the quality
of the compiler in my opinion.

Propagating the assumption that  overflow cannot occur, and removing
checks based on that,  unnecessarily breaks code that otherwise,
with wraparound semantics, would be well-defined and correct.

  -Geert


Re: Ada front-end depends on signed overflow

2005-06-03 Thread Geert Bosch


On Jun 3, 2005, at 09:02, Florian Weimer wrote:

It probably makes sense to turn on -fwrapv for Ada because even
without -gnato, the behavior is not really undefined:

| The reason that we distinguish overflow checking from other kinds of
| range constraint checking is that a failure of an overflow check can
| generate an incorrect value, but cannot cause erroneous behavior.




(Without -fwrapv, integer overflow is undefined, and subsequent range
checks can be optimized away, so that it might cause erroneous
behavior.)



This is the strongest argument I have seen so far for defaulting
to either -ftrapv or -fwrapv.

Both the example Victor Skinner sent, and range checking in Ada
are cases of reasonable code where reasoning from undefined
behavior breaks code in unexpected ways. Essentially a class of
programs that checks for errors and would not otherwise be able
to cause a SIGSEGV or similar with previous versions of GCC,
would now be able to cause arbitrary mayhem.

Integer expressions such as X + Y - Z will give the correct
mathematical result, as long as that result is a representable
integer. Without wrap-around semantics however, one would have
to prove no intermediate result may overflow. Also, addition
and subtraction are no longer associative.

So, from a quality-of-implementation point of view, I believe
we should always default to -fwrapv.

For Ada, I propose we make the following changes:
  - by default, enable overflow checks using -ftrapv
(jay, we should be able to get rid of -gnato finally!)
  - with checks suppressed, use -fwrapv.



Re: [PR22319] Ada broken with ICE in tree-ssa-structalias...

2005-07-06 Thread Geert Bosch

This is http://gcc.gnu.org/PR22319.

On Jul 6, 2005, at 06:17, Andreas Schwab wrote:

Andreas Jaeger <[EMAIL PROTECTED]> writes:


Building ada with the patch for flag_wrapv fails now with a new  
error:


+===GNAT BUG  
DETECTED==+
| 4.1.0 20050706 (experimental) (x86_64-suse-linux-gnu) GCC  
error: |
| tree check: expected integer_cst, have cond_expr  
in  |
|do_structure_copy, at tree-ssa-structalias.c: 
2410 |




Also on ia64 (without -fwrapv).

Andreas.

--
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





Re: Excess precision problem on IA-64

2005-10-27 Thread Geert Bosch


On Oct 27, 2005, at 14:12, Eric Botcazou wrote:
I'm under the impression that it's worse on IA-64 because of the  
"infinite

precision", but I might be wrong.


Fused multiply-add always uses "infinite precision" in the intermediate
result. Only a single rounding is performed at the end. We really should
have a way to express situations where operations may or may not be
contracted, instead of having to do this on a whole-compilation level.

Many computations, especially those involving higher-precision  
intermediate
results, are far easier to implement with explicit control over cases  
where
contractions are allowed. Such techniques are used a lot for  
implementation

of correctly rounded elementary functions.

Currently the only way we can get that kind of precision, while being
able to have the code inlined, is to use assembler insertions, that
unnecessarily make the code system dependent.

  -Geert


Re: Excess precision problem on IA-64

2005-10-27 Thread Geert Bosch

On Oct 27, 2005, at 17:19, Andreas Schwab wrote:

I think this is what the FP_CONTRACT pragma is supposed to provide.


Yes, but it seems there is no way in the middle end / back end to
express this. Or am I just hopelessly behind the times? :)

  -Geert


Re: Excess precision problem on IA-64

2005-10-27 Thread Geert Bosch


On Oct 27, 2005, at 17:25, Steve Ellcey wrote:
It would be easy enough to add an option that turned off the use of  
the
fused multiply and add in GCC but I would hate to see its use  
turned off

by default.


Code that cares should be able to express barriers across which no
contraction is possible. Especially with templates or inlining
across compilation units, global flags become too limited.

Many useful primitives can only be implemented with exact knowledge
of where rounding happens. With arbitrary contractions, numerical
analysis becomes infeasible and some essential algorithms break
down completely.

For example, you may have code that evaluates some polynomials and
other straightforward computations involving approximated inputs.
This code may benefit from contractions due to fused multiply-add
both with regards to speed and accuracy. However, then you might
want to round computed values to integer, rounding halfway
values away from zero. If the machine has no primitive, this may
be implemented with truncation and addition of a special constant.
But, when such addition gets combined with other operations,
eliminating the rounding operation, the final result can be incorrect,
causing numbers to be rounded the wrong way.

There are two ways around this: for code targeted to a specific
processor, use inline assembly for the code relying on precise
rounding. The other is: compile all code with conservative
optimizations. For GCC, we are interested in writing and compiling
code that works across a wide range of targets. The two workarounds
mentioned above either needlessly limit the performance (by using
conservative optimization everywhere), or portability (by relying
on inline assembly).

  -Geert



Re: arm-rtems Ada Aligned_Word compilation error

2005-11-15 Thread Geert Bosch

On Nov 14, 2005, at 19:59, Jim Wilson wrote:

Joel Sherrill <[EMAIL PROTECTED]> wrote:

s-auxdec.ads:286:13: alignment for "Aligned_Word" must be at least 4
Any ideas?


I'm guessing this is because ARM sets STRUCTURE_SIZE_BOUNDARY to 32  
instead of 8, and this confuses the Ada front end.


Note that this package is provided only for the purpose of allowing a  
(non-standard) mode that extends the System package with certain DEC  
Ada extensions. Since this target cannot implement the required  
alignment anyway, I would recommend configuring things so this  
package doesn't get build on that target.


  -Geert


Re: arm-rtems Ada Aligned_Word compilation error

2005-11-15 Thread Geert Bosch


On Nov 15, 2005, at 18:11, Laurent GUERBY wrote:
What about moving s-auxdec from ada/Makefile.rtl  
GNATRTL_NONTASKING_OBJS
into EXTRA_GNATRTL_NONTASKING_OBJS so it can be set for VMS targets  
only

in ada/Makefile.in?


This is not ideal, because some people are migrating from DEC Ada
to GNAT on other platforms, and benefit from the extension.

  -Geert


Re: Link-time optimzation

2005-11-17 Thread Geert Bosch

On Nov 17, 2005, at 21:33, Dale Johannesen wrote:

When I arrived at Apple around 5 years ago, I was told of some recent
measurements that showed the assembler took around 5% of the time.
Don't know if that's still accurate.  Of course the speed of the  
assembler
is also relevant, and our stubs and lazy pointers probably mean  
Apple's

.s files are bigger than other people's.


Of course, there is a reason why almost any commercial compiler writes
object files directly. If you start feeding serious GCC output through
IAS (the Intel assembler) on a platform like IA64, you'll find that this
really doesn't work. A file that takes seconds to compile can take over
an hour to assemble.

GCC tries to write out assembly in a way that is unambiguous, so
the exact instructions being used are known. Any platform with a
"smart optimizing", assembler will run into all kinds of issues.
(Think MIPS.) Many assembler features, such as decimal floating-point
number conversion, are so poorly implemented that they should be avoided
at all cost. Some assemblers like to do their own instruction splitting,
NOP insertion and dependency detection, completely throwing off choices
made by the compiler's scheduler. Then there is alignment of code  
labels.

If there even is the slightest doubt about what exact instruction
encoding the assembler will use, all bets are off here too.

If you'd start from scratch and want to get everything exactly right,
it seems clear that the assembly output path is far harder to implement
than writing object code directly. When you know exactly what bits you
want, just go and write them. However, given that all ports are
implemented based on assembly output, and that many users depend on
assembly output being available, changing GCC's ways will be
very labor intensive unfortunately.

  -Geert


Re: Clarifying attribute-const

2015-10-01 Thread Geert Bosch

> On Oct 1, 2015, at 11:34 AM, Alexander Monakov  wrote:
> 
> Can you expand on the "etc." a bit, i.e., may the compiler ...
> 
>  - move a call to a "const" function above a conditional branch, 
>causing a conditional throw to happen unconditionally?
No, calls may only be omitted, not moved.
> 
>  - move a call to a "const" function below a conditional branch, 
>causing an unconditional throw to happen only conditionally?
> No, calls may only be omitted, not moved.
> 
>  - reorder calls to "const" functions  w.r.t. code with side effects, or
>other throwing functions?
A call to a pure function (Ada's version of "const") may be omitted if its 
result is not used,
or if results of an earlier call with the same argument values (including 
referenced values) can
be used. This is allowed regardless of whether the original function had any 
side effects.
Note that if a function raised an exception (threw) the call can only be 
replaced with throwing
that exception.

So, reordering is not allowed, but omitting is, in the context of Ada.

  -Geert



Re: FloatingPointMath and transformations

2014-06-02 Thread Geert Bosch

On Jun 2, 2014, at 10:06 AM, Vincent Lefevre  wrote:

> I've looked at
> 
>  https://gcc.gnu.org/wiki/FloatingPointMath
> 
> and there may be some mistakes or missing info.

That’s quite possible. I created the page many years ago, based on my
understanding of GCC at that time. 
> 
> First, it is said that x / C is replaced by x * (1.0 / C) when C is
> a power of two. But this condition is not sufficient: if 1.0 / C
> overflows, the transformation is incorrect. From some testing,
> it seems that GCC detects the overflow case, so that it behaves
> correctly. In this case I think that the wiki should say:
> "When C is a power of two and 1.0 / C doesn't overflow.”
Yes, that was implied, but should indeed be made explicit.
> 
> It is also said that x / 1.0 and x / -1.0 are respectively replaced
> by x and -x. But what about x * 1.0 and x * -1.0?
> 
> Ditto with -(a / b) -> a / -b and -(a / b) -> -a / b. Is there
> anything similar with multiplication?

It should, or it would be a bug. Please feel free to add/correct anything on 
this page.

  -Geert

Re: GCC version bikeshedding

2014-07-20 Thread Geert Bosch

On Jul 20, 2014, at 5:55 PM, Jakub Jelinek  wrote:

> So, what versioning scheme have we actually agreed on, before I change it in
> wwwdocs?  Is that
> 5.0.0 in ~ April 2015, 5.0.1 in ~ June-July 2015 and 5.1.0 in ~ April 2016,
> or
> 5.0 in ~ April 2015, 5.1 in ~ June-July 2015 and 6.0 in ~ April 2016?
> The only thing I understood was that we don't want 4.10, but for the rest
> various people expressed different preferences and then it was presented as
> agreement on 5.0, which applies to both of the above.

Can we use the switch to 5.0, a supposedly stable C++11 ABI etc,
also as an excuse to finally configure for --with-sse2 by default 
for 32-bit x86? Maybe then we can finally retire PR 323 and its 
dozens of duplicates...

  -Geert


Re: C as intermediate language, signed integer overflow and -ftrapv

2014-07-30 Thread Geert Bosch

On Jul 23, 2014, at 10:56 AM, Thomas Mertes  wrote:

> One such feature is the detection of signed integer overflow. It is
> not hard, to detect signed integer overflow with a generated C
> program, but the performance is certainly not optimal. Signed integer
> overflow is undefined behavior in C and the access to some overflow
> flag of the CPU is machine dependent.

Actually, doing proper signed integer overflow checking in a front end 
can be surprisingly cheap.

I have some experience with this for the Ada front end, and found the
following:
  - In many cases it may be cheapest to widen computations to avoid
overflow, and/or check it less frequently.
  - Even if you need to check, often one side is known to be constant,
in which case a simple comparison of the input argument is sufficient
  - In other cases, the sign of one of the operands is known,
simplifying the check
  - Conversions from signed to unsigned types is essentially free and
well-defined, so do the overflow check using unsigned types, but use
signed integer operations for the actual computation:
  - By using a simple comparison to jump to a no_return function, GCC
knows the condition is expected to be false and will optimize accordingly

Note that in the second case above, the extra conditional (which will almost
always be correctly predicted by the CPU and often is free) will, combined with
the conditional transfer of control to a no_return routine, in effect
provide range information to the compiler, allowing the elimination of
redundant checks etc. The positive effects of expanding checks to optimizable
C-like constructs are far larger than the eventual instruction selection.
We found the cost of overflow, even without "jo" instructions being generated,
to be generally in the order of 1 - 2% in execution speed and a bit more in
growth of executable size (in our case around 10% due to generating exceptions 
with
location information).

If you make overflow checking "special" early by resorting specific builtins,
-ftrapv or similar, you'll lose out in the general purpose optimization passes 
and in
my experience will get far worse code. If your language semantics are: provide 
the
numerically correct answer (as if computed with unbounded range) or raise an
exception, you can probably do better by using wider types and smart expansions
to avoid overflow while retaining C-level intermediate code.

Anyway, the Ada front end is proof that efficient overflow checking is possible
without any special support in the back end.

  -Geert

Re: should sync builtins be full optimization barriers?

2011-09-09 Thread Geert Bosch

On Sep 9, 2011, at 04:17, Jakub Jelinek wrote:

> I'd say they should be optimization barriers too (and at the tree level
> they I think work that way, being represented as function calls), so if
> they don't act as memory barriers in RTL, the *.md patterns should be
> fixed.  The only exception should be IMHO the __SYNC_MEM_RELAXED
> variants - if the CPU can reorder memory accesses across them at will,
> why shouldn't the compiler be able to do the same as well?

They are different concepts. If a program runs on a single processor,
all memory operations will appear to be sequentially consistent, even if
the CPU reorders them at the hardware level.  However, compiler 
optimizations can still cause multiple threads to see the accesses 
as not sequentially consistent. 

For example, for atomic objects accessed only from a single processor 
(but  possibly multiple threads), you'd not want the compiler to reorder
memory accesses to global variables across the atomic operations, but 
you wouldn't have  to emit the expensive fences.

For the C++0x atomic types there are:

void A::store(C desired, memory_order order = memory_order_seq_cst) volatile;
void A::store(C desired, memory_order order = memory_order_seq_cst);

where the first variant (with order = memory_order_relaxed) 
would allow fences to be omitted, while still preventing the compiler from
reordering memory accesses, IIUC.

To be honest, I can't quite see the use of completely unordered
atomic operations, where we not even prohibit compiler optimizations.
It would seem if we guarantee that a variable will not be accessed
concurrently from any other thread, we wouldn't need the operation
to be atomic in the first place. That said, it's quite likely I'm 
missing something here. 

For Ada, all atomic accesses are always memory_order_seq_cst, and we
just care about being able to optimize accesses if we know they'll be
done from the same processor. For the C++11 model, thinking about
the semantics of any memory orders other than memory_order_seq_cst
and their interaction with operations with different ordering semantics
makes my head hurt.

Regards,
  -Geert


Re: should sync builtins be full optimization barriers?

2011-09-11 Thread Geert Bosch

On Sep 11, 2011, at 10:12, Andrew MacLeod wrote:

>> To be honest, I can't quite see the use of completely unordered
>> atomic operations, where we not even prohibit compiler optimizations.
>> It would seem if we guarantee that a variable will not be accessed
>> concurrently from any other thread, we wouldn't need the operation
>> to be atomic in the first place. That said, it's quite likely I'm
>> missing something here.
>> 
> there is no guarantee it isnt being accessed concurrently,  we are only 
> guaranteeing that if it is accessed from another thread, it wont be a 
> partially written value...  if you read a 64 bit value on a 32 bit machine, 
> you need to guarantee that both halves are fully written before any read can 
> happen. Thats the bare minimum guarantee of an atomic.

OK, I now see (in §1.10(5) of the n3225 draft) that “relaxed” atomic operations 
are not synchronization operations even though, like synchronization 
operations, they cannot contribute to data races. 

However the next paragraph says: 
All modifications to a particular atomic object M occur in some particular 
total order, called the modification order of M. [...] There is a separate 
order for each atomic object. There is no requirement that these can be 
combined into a single total order for all objects. In general this will be 
impossible since different threads may observe modifications to different 
objects in inconsistent orders.

So, if I understand correctly, then operations using relaxed memory order will 
still need fences, but indeed do not require any optimization barrier. For 
memory_order_seq_cst we'll need a full barrier, and for the others there is a 
partial barrier.

Also, for relaxed order atomic operations we would only need a single fence 
between two accesses (by a thread) to the same atomic object. 
> 
>> For Ada, all atomic accesses are always memory_order_seq_cst, and we
>> just care about being able to optimize accesses if we know they'll be
>> done from the same processor. For the C++11 model, thinking about
>> the semantics of any memory orders other than memory_order_seq_cst
>> and their interaction with operations with different ordering semantics
>> makes my head hurt.
> I had many headaches over a long period wrapping my head around it, but 
> ultimately it maps pretty closely to various hardware implementations. Best 
> bet?  just use seq-cst until you discover you have a  performance problem!!  
> I expect thats why its the default :-)

We've already discovered that. Atomic types are used quite a bit in Ada code. 
Unfortunately, many of the uses are just for accesses to memory-mapped I/O 
devices, single write. On many systems I/O locations can't be used for 
synchronization anyway, and only regular cacheable memory can be used for that.

For such operations you don't want the compiler to reorder accesses to 
different I/O locations, but mutual exclusion wrt. other threads is already 
taken care of. It seems this is precisely the opposite from what the relaxed 
memory order provides.

Regards,
  -Geert


Re: should sync builtins be full optimization barriers?

2011-09-11 Thread Geert Bosch

On Sep 11, 2011, at 15:11, Jakub Jelinek wrote:

> On Sun, Sep 11, 2011 at 03:00:11PM -0400, Geert Bosch wrote:
>> Also, for relaxed order atomic operations we would only need a single
>> fence between two accesses (by a thread) to the same atomic object.
> 
> I'm not aware of any CPUs that would need any kind of fences for that.
> Nor the compiler should need any fences for that, MEMs that may (or even are
> known to be aliased) aren't reordered.

I guess for CPUs with TSO that might be right wrt. the hardware.
I wouldn't say it is true in general.
But all atomic operations on an atomic object M should have 
a total order. That means the compiler 

So for some atomic int X, with relaxed ordering:

  if (X == 0) X = 1;
  else X = 2;

we can't optimize that to:

 X = 1;
 if (X != 0) X = 2;

Do you agree?

-Geert


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Geert Bosch

On Sep 12, 2011, at 03:02, Paolo Bonzini wrote:

> On 09/11/2011 09:00 PM, Geert Bosch wrote:
>> So, if I understand correctly, then operations using relaxed memory
>> order will still need fences, but indeed do not require any
>> optimization barrier. For memory_order_seq_cst we'll need a full
>> barrier, and for the others there is a partial barrier.
> 
> If you do not need an optimization barrier, you do not need a processor 
> barrier either, and vice versa.  Optimizations are just another factor that 
> can lead to reordered loads and stores.

Assuming that statement is true, that would imply that even for relaxed 
ordering there has to be an optimization barrier. Clearly fences need to be 
used for any atomic accesses, including those with relaxed memory order.

Consider 4 threads and an atomic int x:

thread 1  thread 2  thread 3  thread 4
      
  x=1;  r1=x  x=3;  r3=x;
  x=2;  r2=x  x=4;  r4=x;

Even with relaxed memory ordering, all modifications to x have to occur in some 
particular total order, called  the modification order of x.

So, even if each thread preserves its store order, the modification order of x 
can be any of:
  1,2,3,4
  1,3,2,4
  1,3,4,2
  3,1,2,4
  3,1,4,2
  3,4,1,2

Because there is a single modification order for x, it would be an error for 
thread 2 and thread 4 to see a different update order.

So, if r1==2,r2==3 and r3==4,r4==1, that would be an error. However, without 
fences, this can easily happen on an SMP machine, even one with a nice memory 
model such as the x86.

IIUC, the relaxed memory model mostly seems to allow movement (by compiler and 
CPU) of unrelated memory operations, but still requires fences between 
subsequent atomic operations on the same object. 

In other words, while atomic operations with relaxed memory order on some 
atomic object X cannot be used to synchronize any operations on objects other 
than X, they themselves cannot cause data races.

  -Geert


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Geert Bosch

On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:

> Lets simplify it slightly.  The compiler can optimize away x=1 and x=3 as 
> dead stores (even valid on atomics!), leaving us with 2 modification orders..
> 2,4 or 4,2
> and what you are getting at is you don't think we should ever see
> r1==2, r2==4  and r3==4, r4==2
Right, I agree that the compiler can optimize away both the
double writes and double reads.
> 
> lets say the order of the writes turns out to be  2,4...  is it possible for 
> both writes to be travelling around some bus and have thread 4 actually read 
> the second one first, followed by the first one?   It would imply a lack of 
> memory coherency in the system wouldn't it? My simple understanding is that 
> the hardware gives us this sort of minimum guarantee on all shared memory. 
> which means we should never see that happen.

No, it is possible, and actually likely. Basically, the issue is write buffers. 
The coherency mechanisms come into play at a lower level in the hierarchy 
(typically at the last-level cache), which is why we need fences to start with 
to implement things like spin locks.

Threads running on the same CPU may be share the same caches (think about a 
thread switch or hyper-threading). Now both processors may have a copy of the 
same cache line and both try to do a write to some location in that line. Then 
they'll both try to get exclusive access to the cache line. One CPU will 
succeed, the other will have a cache miss.

However, while all this is going on, the write is just sitting in a write 
buffer, and any references from the same processor will just get forwarded the 
value of the outstanding write. 

> And if we can't see that, then I don't see how we can see your example..  
> *one* of those modification orders has to be what is actually written to x, 
> and reads from that memory location will not be able to see an something 
> else. (ie, if it was 1,2,3,4  then thread 4 would not be able to see 
> r3==4,r4==1 thanks to memory coherency.

No that's false. Even on systems with nice memory models, such as x86 and SPARC 
with a TSO model, you need a fence to avoid that a write-load of the same 
location is forced to make it all the way to coherent memory and not forwarded 
directly from the write buffer or L1 cache. The reasons that fences are 
expensive is exactly that it requires system-wide agreement.

 -Geert


Re: should sync builtins be full optimization barriers?

2011-09-13 Thread Geert Bosch

On Sep 13, 2011, at 08:08, Andrew MacLeod wrote:

> On 09/12/2011 09:52 PM, Geert Bosch wrote:
>> No that's false. Even on systems with nice memory models, such as x86 and 
>> SPARC with a TSO model, you need a fence to avoid that a write-load of the 
>> same location is forced to
Note that here with write-load I meant a write instruction *and* a subsequent 
load instruction.
>>  make it all the way to coherent memory and not forwarded directly from the 
>> write buffer or L1 cache. The reasons that fences are expensive is exactly 
>> that it requires system-wide agreement.
> 
> On x86, all the atomic operations are prefixed with LOCK which is suppose to 
> grant them exclusive use of shared memory. Ken's comments would appear to 
> indicate that imposes a total order across all processors.
Yes, that's right. All atomic read-modify-write operations have an implicit 
full barrier on x86 and on SPARC. However, my example was about regular stores 
and loads from an atomic int using the C++ relaxed memory model. Indeed, just 
using XCHG (or SWAP on SPARC) instructions for writes and regular loads for 
reads is sufficient to establish a total order.

These are expensive synchronizing instructions though, with full barrier 
semantics. For the relaxed memory model, the compiler would be able to optimize 
away redundant loads and stores, as you indicated before.

> I presume other architectures have similar mechanisms if they support atomic 
> operations.  You have to have *some* way of having 2 threads which 
> simultaneous perform read/modify/write atomic instructions work properly...
Yes, read-modify-write instructions also function as full barrier.
> 
> Assume x=0, and 2 threads both execute a single atomic increment operation:
>  { read x, add 1, write result back to x }
> When both threads have finished, the result *has* to be x == 2.  So the 2 
> threads must be able to see some sort of coherent value for x.
Indeed. The trouble is with regular reads and writes.
> 
> If coherency is provided for read/modify/write, it should also be available 
> for read or write as well...


No, unless you replace writes by read-modify-write instructions, or you insert 
additional fences. Regular writes are buffered, and initially only visible to 
the processor itself. The reason regular writes to memory are so fast is that 
the processor doesn't have to wait for the write to percolate down the memory 
hierarchy, but can continue processing using *its* last written value.

  -Geert


Re: weird optimization in sin+cos, x86 backend

2012-02-05 Thread Geert Bosch

On Feb 5, 2012, at 11:08, James Courtier-Dutton wrote:

> But, r should be
> 5.26300791462049950360708478127784... or
> -1.020177392559086973318201985281...
> according to wolfram alpha and most arbitrary maths libs I tried.
> 
> I need to do a bit more digging, but this might point to a bug in the
> cpu instruction FPREM1

No, this is all as expected. The instructions are documented to use
a 66-bit approximation of Pi (really 64 bits, but the next two
happen to be 0).

Ada requires a relative error less than 2 eps for arguments in the
range - 2.0**32 .. 2.0**32, for a binary floating point type
with 64 bits of mantissa. So, he GCC Ada run time library uses 
a 150-bit or so approximation to ensure accurate argument reduction
over the required range.

Even with an approximation of Pi that is not precise enough to
guarantee a small relative error of the result, there is still
value in consistent argument reduction. For example, a point
(Sin (X), Cos (X)) should always be close to the unit circle,
regardless of the magnitude of X.

  -Geert


Re: Anyone else run ACATS on ARM?

2009-08-26 Thread Geert Bosch


On Aug 12, 2009, at 10:32, Joel Sherrill wrote:


Hi,

GNAT doesn't build for arm-rtems on 4.4.x or
SVN (PR40775).  I went back to 4.3.x since I
remembered it building.
I have run the ACATS on an ep7312 target and
get a number of generic test failures that
don't look RTEMS specific.  Has anyone run
ACATS on arm?


Yes, we ported it to ARM/Nucleus OS, and we required some fixes to
prologue generation. The patches we submitted for that to the
mailinglist and then pinged, were ignored. I'm sure this is in
the archives somewhere.

  -Geert


Re: Compiling the GNU ada compiler on a very old platform

2009-08-26 Thread Geert Bosch


On Aug 21, 2009, at 18:40, Paul Smedley wrote:


Hi All,

I'm wanting to update the GNU ADA compiler for OS/2... I'm currently
building GCC 4.3.x and 4.4.x on OS/2 (C/C++/fortran) but for ADA
configure complains about not finding gnat.  The problem is that the
only gnat compiled for OS/2 was years ago using a different toolchain
so it's not suitable.


I used to maintain the OS/2 port for AdaCore, but that was many years  
ago.

IBM released its last version of OS/2 in 2001. Currently it is almost
impossible to run OS/2 in either real modern hardware or on a  
virtualized

system.

AFAIK, GNAT 3.15p is the last GNAT version with OS/2 support. As the  
OS/2
version was a full implementation of Ada 95, including all annexes,  
passing

all ACATS tests, this version should still be very useful today, if you
have a system running OS/2, that is. If you're interested in developing
Ada applications on OS/2, your best bet is to use GNAT 3.15p.
You'll get a mature well-tested and very fast compiler.

This version might still be able to bootstrap GNAT.

  -Geert


Re: [ada] help debugging acats failure

2009-09-03 Thread Geert Bosch

If you pass -v to gnatmake, it will output the gcc invocations.
This should be sufficient to find the problem.

Basically, just go to the directory containing c35502i.adb, and
execute the gnatmake command as listed below, with -v added in.
If you only have the 35502i.ada file available, use "gnatchop  
35502i.ada"

to get the various units split out in their own files.
You might need to specifically include the "support" directory,
which appears to be /home/rth/work/gcc/bld-sjlj/gcc/testsuite/ada/acats0
from your report.

Hope this helps.

  -Geert

On Sep 3, 2009, at 19:00, Richard Henderson wrote:


Can someone tell me how to debug this:

splitting /home/rth/work/gcc/bld-sjlj/gcc/testsuite/ada/acats0/ 
tests/c3/c35502i.ada into:

  c35502i.adb
BUILD c35502i.adb
gnatmake --GCC="/home/rth/work/gcc/bld-sjlj/gcc/xgcc -B/home/rth/ 
work/gcc/bld-sjlj/gcc/" -gnatws -O2 -I/home/rth/work/gcc/bld-sjlj/ 
gcc/testsuite/ada/acats0/support c35502i.adb -largs --GCC="/home/ 
rth/work/gcc/bld-sjlj/gcc/xgcc -B/home/rth/work/gcc/bld-sjlj/gcc/"
/home/rth/work/gcc/bld-sjlj/gcc/xgcc -c -B/home/rth/work/gcc/bld- 
sjlj/gcc/ -gnatws -O2 -I/home/rth/work/gcc/bld-sjlj/gcc/testsuite/ 
ada/acats0/support c35502i.adb
gnatbind -I/home/rth/work/gcc/bld-sjlj/gcc/testsuite/ada/acats0/ 
support -x c35502i.ali
gnatlink c35502i.ali --GCC=/home/rth/work/gcc/bld-sjlj/gcc/xgcc -B/ 
home/rth/work/gcc/bld-sjlj/gcc/

./c35502i.o: In function `_ada_c35502i':
c35502i.adb:(.text+0x156): undefined reference to `.L47'
collect2: ld returned 1 exit status
gnatlink: error when calling /home/rth/work/gcc/bld-sjlj/gcc/xgcc
gnatmake: *** link failed.
FAIL:   c35502i


I haven't been able to figure out what command to issue from the  
command line to reproduce this.  Cut and paste from the dejagnu log  
doesn't work, which is more than annoying...



r~




Re: GCC 4.5 Status Report (2009-09-19)

2009-09-19 Thread Geert Bosch


On Sep 19, 2009, at 18:02, Steven Bosscher wrote:

* GDB test suite should pass with -O1


Apparently, the current GDB test suite can only work at -O0,
because code reorganization messes up the scripting.

  -Geert


Re: Worth balancing the tree before scheduling?

2009-11-23 Thread Geert Bosch

On Nov 23, 2009, at 10:17, Ian Bolton wrote:

> Regardless of the architecture, I can't see how an unbalanced tree would
> ever be a good thing.  With a balanced tree, you can still choose to
> process it in either direction (broad versus deep) - whichever is better
> for your architecture - but, as far as I can see (bearing in mind that
> I'm very new to GCC development!), a tall lop-sided tree gives few
> scheduling options due to all the extra dependencies.  I guess I must
> be missing something?

Yes, a lop-sided tree often needs less registers.
For example, (((a+b)+c)+d)+e would only need 2 registers. 
Any more balanced tree would need at least one more.

  -Geert


Re: Change x86 default arch for 4.5?

2010-02-21 Thread Geert Bosch

On Feb 21, 2010, at 06:18, Steven Bosscher wrote:
> My point: gcc may fail to attract users (and/or may be losing users)
> when it tries to tailor to the needs of minorities.
> 
> IMHO it would be much more reasonable to change the defaults to
> generate code that can run on, say, 95% of the computers still in use.
> If a user want to use the latest-and-greatest gcc for a really old
> machine, the burden of adding extra flags to change the default
> behavior of the compiler should be on that user.
> 
> In this case of the i386 back end, that probably means changing the
> default to something like pentium3.

The biggest change we need to make for x86 is to enable SSE2,
so we can get proper rounding behavior for float and double,
as well as significant performance increases.

  -Geert


Re: Change x86 default arch for 4.5?

2010-02-21 Thread Geert Bosch

On Feb 21, 2010, at 09:58, Joseph S. Myers wrote:
> On Sun, 21 Feb 2010, Richard Guenther wrote:
>>> The biggest change we need to make for x86 is to enable SSE2,
>>> so we can get proper rounding behavior for float and double,
>>> as well as significant performance increases.
>> 
>> I think Joseph fixed the rounding behavior for 4.5.  Also without an adjusted
> 
> Well, I provided the option for rounding that is predictable and in 
> accordance with C99 when using the default -mfpmath=387.  But that option 
> does carry the performance cost of storing to / loading from memory at 
> various points, as required to get the rounding on 387 (and there are 
> still cases where excess precision means double rounding).

This clearly is worse in all areas than using SSE2: there STILL is
double rounding, and performance goes down the drain. On any recent
machine there is the SSE2 hardware to do the operations correctly
without going through memory and without double rounding.

>> ABI you'd introduce x87 <-> SSE register moves which are not helpful
>> for performance.
> 
> As I understand it, whether -mfpmath=387 (with excess precision) or 
> -mfpmath=sse is the default is also considered part of the platform API 
> (like whether char is signed or unsigned by default, for example), in 
> addition to the ABI issues that can slow things down when SSE is used.

No, this is not a new ABI. The ABI stays exactly the same. The I of
ABI stands for Interface. That is, the ABI has nothing to say about
how a function will compute any results. That is the area of language
standards. The only way we'd violate the ABI is the reliance on SSE2
instructions being available and SSE2 registers being saved by the OS.

However, since any other compiler uses SSE2 instructions by default,
I don't see why GCC should be any different. If anything, since the
GCC target audience is more focused on Free and open source software,
we could be more aggressive in taking advantage of newer hardware.
What about an autoconf test for availability of 486 atomic instructions,
and SSE2 instructions in order, and choosing the default target based
on the host? Not too crazy, is it?

> If people really want a new 32-bit x86 ABI I'd suggest making it an ILP32 
> ABI for processors running in 64-bit mode, so 64-bit registers are 
> available without the additional memory cost of 64-bit pointers for code 
> not needing them - you could also assume a minimum of -march=x86-64, which 
> implies SSE2.  But if there were significant demand for such an ABI I 
> think we'd have seen it by now, and you probably run into the various 
> syscall interface problems that MIPS n32 has had.

Let's keep that can of worms closed, has nothing to do with changing
our -march defaults.

  -Geert

PS. If anyone has SPEC-like figures for performance difference between -msse2
-mfpmath=sse,x87  and default, I'd be interested in seeing those results.


Re: Change x86 default arch for 4.5?

2010-02-21 Thread Geert Bosch

On Feb 21, 2010, at 17:42, Erik Trulsson wrote:
> Newer compilers usually have better generic optimizations that are not
> CPU-dependent.  Newer compilers also typically have improved support
> for new language-features (and new languages for that matter.)

This is exactly where CPU dependence comes into play. I'd like
for example take advantage of IEEE floating point semantics in
our Ada compiler. Fifteen years ago, we had to face the fact that
some systems were not compliant and therefore we developed
least-common-denominator libraries that catered to all.

However, nowadays even OpenVMS supports IEEE floating point.
I'd like to get rid of hacks in GNAT avoid relying on strict
IEEE compliance. Currently we end up doing some conversions
and computations in the widest floating-point type to avoid
double rounding.

So, I'd like to be able to put all this behind me, and only
have to think about fully IEEE compliant systems. Similarly,
there is no way to fully implement C99 Annex F without using
SSE2.

As a last point, because not many GCC users take advantage
of the hardware their systems have, there is less focus
and feedback on improving SSE2 code generation, vectorization 
and the like.

In conclusion, I think GCC is being held back by us sticking
to 20-year old hardware. For GCC to be the best compiler 
possible on current and future systems, we have to start
compiling for those systems by default.

This is essential for GCC's long term relevance.

  -Geert


Re: Change x86 default arch for 4.5?

2010-02-21 Thread Geert Bosch

On Feb 21, 2010, at 12:34, Joseph S. Myers wrote:

> Correct - I said API, not ABI.  The API for C programs on x86 GNU/Linux 
> involves FLT_EVAL_METHOD == 2, whereas that on x86 Darwin involves 
> FLT_EVAL_METHOD == 0 and that on FreeBSD involves FLT_EVAL_METHOD == 2 
> but with FPU rounding precision set to 53 bits so only excess range, not 
> precision, applies.

Your following paragraph "If people really want a new 32-bit x86 ABI..."
threw me off. Your message contained the word ABI 4 times, and API once.
I missed it, apologies.

However, I think this is a bit of a red herring. There
really is no consistent well-defined model for our current 
floating-point semantics on x86. Wether a value is computed with
extra precision or not depends on many factors, such as optimization.

The best way to describe the current situation is that any operation
may or may not keep excess precision or range. Your patch to incur 
more double rounding by going through memory more often does not 
really solve this issue, it will just make results a little more 
consistent, but possible also less accurate and less precise.

So, in short, we all will lose in a quest to support C99
on least common denominator hardware. We will not
be able to achieve full compliance, we will all have to 
deal with worse performance due to additional spilling. 
All at the same time any other compiler will
happily take advantage of the user's hardware and compile
great code that is faster and produces better results.

  -Geert


Re: Change x86 default arch for 4.5?

2010-02-21 Thread Geert Bosch

On Feb 21, 2010, at 07:13, Richard Guenther wrote:
> The present discussion is about defaulting to at least 486 when not
> configured for i386-linux.  That sounds entirely reasonable to me.

I fully agree with the "at least 486" part. However,
if we only change the default once every 20 years, it seems
we should bump it up more than to just 486...

People who compile GCC from sources, mostly use it to
compile other code from source for their own use.
The only reason to by default generate code for an
older chip than the one on the host, is to distribute
binaries. Why would a GNU compiler by default give
up performance and numerical stability to facilitate
binary distribution? 

Basically, GCC has to compete with other compilers,
such as those from Microsoft and Intel. When GCC 
arbitrarily decides to tie one hand behind its back,
so that code by default targets a 25-year-old
chip, it is no surprise it comes out looking bad.

  -Geert



Re: Change x86 default arch for 4.5?

2010-02-21 Thread Geert Bosch

On Feb 21, 2010, at 20:57, Joseph S. Myers wrote:
> I know some people have claimed (e.g. glibc bug 6981) that you can't 
> conform to Annex F when you have excess precision, but this does not 
> appear to be the view of WG14.

That may be the case, but I really wonder how much sense it
can make to declare things like:

> The expressions x − y and −(y − x) are not equivalent because 1 − 1 is +0 but 
> −(1 − 1) is −0 (in the default rounding direction)

if you than have to wave hands and admit that either side might
be off by an arbitrary amount due to presence or absence of
double rounding, the whole statement becomes meaningless. 

In the end, what counts is thay users have complained for
ages about unexpected floating-point results on x86.
We have always argued that that was just the way things
are on x86: sorry, hardware issue, can't fix.

However, we have progressed 25 years, and through the wonders
of SSE2 we now can have both extended precision when we need
it and accurate, predictable single and double floating-point
for the rest. If we get this issue out of the way, and make
GCC by default have IEEE 754 compliant floating point on hardware
supporting it, that would be a great step forward.

  -Geert


Re: Optimizing floating point *(2^c) and /(2^c)

2010-03-29 Thread Geert Bosch

On Mar 29, 2010, at 13:19, Jeroen Van Der Bossche wrote:

> 've recently written a program where taking the average of 2 floating
> point numbers was a real bottleneck. I've looked into the assembly
> generated by gcc -O3 and apparently gcc treats multiplication and
> division by a hard-coded 2 like any other multiplication with a
> constant. I think, however, that *(2^c) and /(2^c) for floating
> points, where the c is known at compile-time, should be able to be
> optimized with the following pseudo-code:
> 
> e = exponent bits of the number
> if (e > c && e < (0b111...11)-c) {
> e += c or e -= c
> } else {
> do regular multiplication
> }
> 
> Even further optimizations may be possible, such as bitshifting the
> significand when e=0. However, that would require checking for a lot
> of special cases and require so many conditional jumps that it's most
> likely not going to be any faster.
> 
> I'm not skilled enough with assembly to write this myself and test if
> this actually performs faster than how it's implemented now. Its
> performance will most likely also depend on the processor
> architecture, and I could only test this code on one machine.
> Therefore I ask to those who are familiar with gcc's optimization
> routines to give this 2 seconds of thought, as this is probably rather
> easy to implement and many programs could benefit from this.

For any optimization suggestions, you should start with showing some real, 
compilable, code with a performance problem that you think the compiler could 
address. Please include details about compilation options, GCC versions and 
target hardware, as well as observed performance numbers. How do you see that 
averaging two floating point numbers is a bottleneck? This should only be a 
single addition and multiplication, and will execute in a nanosecond or so on a 
moderately modern system.

Your particular suggestion is flawed. Floating-point multiplication is very 
fast on most targets. It is hard to see how on any target with floating-point 
hardware, manual mucking with the representation can be a win. In particular, 
your sketch doesn't at all address underflow and overflow. Likely a complete 
implementation would be many times slower than a floating-point multiply.

  -Geert


Re: Optimizing floating point *(2^c) and /(2^c)

2010-03-31 Thread Geert Bosch

On Mar 29, 2010, at 16:30, Tim Prince wrote:
> gcc used to have the ability to replace division by a power of 2 by an fscale 
> instruction, for appropriate targets (maybe still does).
The problem (again) is that floating point multiplication is 
just too damn fast. On x86, even though the latency may 
be 5 cycles, since the multiplier is fully pipelined, the 
throughput is one multiplication per clock cycle, and that's
for non-vectorized code!

For comparison, the fscale instruction breaks down to 30 µops
or something like that, compared to a single µop for most
forms of floating point multiplication. Given that Jeroen
also needs to do floating-point additions, just bouncing
the values between integer and float registers will be
more expensive than the entire multiplication is in the
first place.

> Such targets have nearly disappeared from everyday usage.  What remains is 
> the possibility of replacing the division by constant power of 2 by 
> multiplication, but it's generally considered the programmer should have done 
> that in the beginning.

No, this is something the compiler does and should do. 
It is well understood that for binary floating point multiplications
division by a power of two is identical to multiplication by its reciprocal,
and it's the compiler's job to select the fastest instruction.

  -Geert


Re: New no-undefined-overflow branch

2009-03-05 Thread Geert Bosch

Hi Richard,

Great to see that you're addressing this issue. If I understand  
correctly,

for RTL all operations are always wrapping, right?

I have been considering adding "V" variants for operations that trap on
overflow. The main reason I have not (yet) pursued this, is the daunting
task of teaching the folders about all these new codes. initially I'd
like to lower the "V" operations to explicit checks (calling abort() or
raising an exception, depending on the language) during gimplification,
but with the idea of eventually delaying expansion as more code learns
how to handle the new expressions. I already have most of the code  
necessary

for the expansions, as I now do them during translation of Ada trees to
GENERIC trees. Actually, the new and saner wrapping semantics for
{PLUS,MINUS,MULT}_EXPR simplify these a bit, by avoiding the need to use
unsigned types.

As you obviously doing something very similar now by introducing "NV"  
variants,
do you think this would fit in with your scheme? If so, I'd be happy  
to try
and expand on your work to have, wrapping, no-overflow and overflow- 
checking

variants for basic arithmetic.

  -Geert


Re: New no-undefined-overflow branch

2009-03-06 Thread Geert Bosch


On Mar 6, 2009, at 09:15, Joseph S. Myers wrote:

It looks like only alpha and pa presently have insn patterns such as
addvsi3 that would be used by the present -ftrapv code, but I expect
several other processors also have instructions that would help in
overflow-checking code.  (For example, Power Architecture has  
versions of
many arithmetic instructions that set overflow flags, so such  
instructions

could be used followed by conditional traps.)


Most architectures have similar flags or conditions, but during my
work on more efficient overflow checking for Ada I have become less
convinced of the need for them. For Ada code, the overflow
checking is now less expensive than many other checks.

In most cases, one of the operands will be either constant or have
a known sign. Then the overflow check can be expanded as a simple
comparison. The benefit of this is that later optimization passes
can use these tests to derive range information, combine checks,
or fold them. When using some opaque construct, removing it will be  
hard.

Also, for many languages calling abort() for overflow would not be
desirable, and an exception should be raised. Doing this directly
from expanded code rather than using a trap handler will avoid the
need to write target-specific trap handlers and has the advantage
that a message with location information can easily be included.

In any case, while I'd really like to move the checked signed
integer overflow from Gigi (GNAT-to-GNU tree translator) to
the language independent part of GCC, I want to have the absolute
minimum amount of changes that is necessary to achieve this goal.

Since only Ada uses integer overflow checking at this point, any
non-standard GIMPLE will be a maintenance burden and is likely
to be mishandled by later optimizers. When we lower all checks
during gimplification, we can re-implement -ftrapv while avoiding
all of the pitfalls the old implementation had. In particular,
there has to be a clear distinction between which signed integer
operations must be checked and which not. During compilation many
new operations are created and these operations need to have
clear semantics. To me that is the great improvement of the
no-undefined-overflow branch.

  -Geert


Re: New no-undefined-overflow branch

2009-03-06 Thread Geert Bosch


On Mar 6, 2009, at 04:11, Richard Guenther wrote:


I didn't spend too much time thinking about the trapping variants
(well, I believe it isn't that important ;)).  But in general we would
have to expand the non-NV variants via the trapping expanders
if flag_trapv was true (so yeah, combining TUs with different  
flag_trapv

settings would be difficult again and it would ask for explicitly
encoding this variant in the IL ...).
The non-NV variants have wrap-around semantics on the no-undefined- 
overflow
branch, right? I'm not about to change that based on some global  
flag! :)

I'm proposing something like:
  {PLUS,MINUS,MULT,NEGATE}_EXPR:
- signed integer operation with wrap-around
  {PLUS,MINUS,MULT,NEGATE)NV_EXPR
- signed integer operations known to not overflow
  {PLUS,MINUS,MULT,NEGATE)V_EXPR
- signed integer operation with overflow check that traps,
  aborts or raises an exception on overflow


There is of course the problem that we have to be careful not to
introduce new traps via folding, a problem that doesn't exist with
the no-overflow variants (I can simply drop to the wrapping variants).
With for example  (a -/v 10) +/v 10 would you want to preserve
the possibly trapping a -/v 10?  For (a -/v 10) +/v (b -/v 10) do
we want to be careful to not introduce extra traps when simplifying
to (a +/v b) -/v 20?

Indeed, there has to be a very clear boundary as we'd be changing
semantics. The original -ftrapv implementation muddled that issue,
something I absolutely want to avoid


So while trapping variants can certainly be introduced it looks like



this task may be more difficult.  So lowering them early during
gimplification looks like a more reasonable plan IMHO.


Right, that was my intention. Still, I'll need to add code to
handle the new tree codes in fold(), right?

  -Geert 


Re: New no-undefined-overflow branch

2009-03-06 Thread Geert Bosch


On Mar 6, 2009, at 12:22, Joseph S. Myers wrote:
If you add new trapping codes to GENERIC I'd recommend *not* making  
fold()
handle them.  I don't think much folding is safe for the trapping  
codes
when you want to avoid either removing or introducing traps.  Either  
lower

the codes in gimplification, or handle them explicitly in a few GIMPLE
optimizations e.g. when constants are propagated in, but avoid general
folding for them.


The point here is not to think in terms of the old -trapv and trapping
instructions, but instead at the slightly higher level of a well-defined
model of signed integer arithmetic. That is why signed integer overflow
checking and the no-undefined-overflow branch are closely related.

There are essentially two models to evaluate a signed integer  
expression.
The one Ada uses is that a check may be omitted if the value of the  
expression,
in absence of the check, would have no effect on the external  
interactions

of the program.

An implementation need not always raise an exception when a language- 
defined check
fails.  Instead, the operation that failed the check can simply  
yield an undefined
result. The exception  need be raised by the implementation only if,  
in the absence
of raising it, the value of this  undefined result would have some  
effect on the
external interactions of the program. In determining this, the  
implementation shall
not presume that an undefined result has a value that  belongs to  
its subtype,
nor even to the base range of its type, if scalar. Having removed  
the raise of
the exception, the canonical semantics will in general allow the  
implementation
to omit the  code for the check, and some or all of the operation  
itself.


The other one is the one you suggest:
Front ends should set TREE_SIDE_EFFECTS on trapping expressions so  
that
fold knows it can't discard a subexpression (whose value doesn't  
matter to
the value of the final expression) containing a trapping expression,  
e.g.
0 * (a +trap b) needs to evaluate (a +trap b) for its side effects.   
With
this done, front ends generating trapping codes for -ftrapv and fold  
not
trying to optimize the trapping codes, I'd hope fold and the rest of  
the

language-independent compiler could stop caring about flag_trapv.


Setting the TREE_SIDE_EFFECTS seriously limits optimizations. Also, as  
quality
of implementation issue, while an expression may have an undefined  
result,

if that result is not used, removing the entire computation is generally
preferable over raising an exception. Arguments can be made for and  
against both

models, so probably we could make setting of TREE_SIDE_EFFECTS optional.

  -Geert


Re: [gnat] reuse of ASTs already constructed

2009-04-20 Thread Geert Bosch


On Apr 12, 2009, at 13:29, Oliver Kellogg wrote:


On Tue, 4 Mar 2003, Geert Bosch  wrote:

[...]
Best would be to first post a design overview,
before doing a lot of work in order to prevent spending time
on implementing something that may turn out to have fundamental
problems.


I've done a little experimenting to get a feel for this.

I've looked at the work done toward the GCC compile server but
decided that I want to concentrate on GNAT trees (whereas the
compile server targets the GNU trees.)

Also I am aiming somewhat lower - not making a separate compile
server process but rather extending gnat1 to handle multiple
files in a single invocation.


While this may be an interesting idea, there are some fundamental  
assumptions
in the compiler that each compilation indeed processes a single  
compilation unit,
resulting in a single object and .ali file. It would be best to first  
contemplate
what output a single invocation of the compiler, with multiple  
compilation units

as arguments, should produce.

How would you decide if a unit needs recompilation if there was no 1:1
correspondence between compilation units and object/.ali files?
Note that unlike many other languages, Ada requires checks to avoid
including out-of-date compilation results in a program.

  -Geert


Re: [gnat] reuse of ASTs already constructed

2009-04-20 Thread Geert Bosch


On Apr 20, 2009, at 14:45, Oliver Kellogg wrote:

It would be best to first contemplate what output a single
invocation of the compiler, with multiple compilation units
as arguments, should produce.


For an invocation
gnat1 a.adb b.adb c.adb
, the files a.{s,ali} b.{s,ali} c.{s,ali} are produced.


The back end is not prepared to produce multiple assembly files.
The "gcc" driver program also assumes each invocation produces a
single .s file.

So, if this is what you want to do, you'd have to address all these
underlying limitations first.

  -Geert


Re: [RFC] Switching implementation language to C++

2010-05-31 Thread Geert Bosch

On May 31, 2010, at 14:25, Mark Mitchell wrote:
> That doesn't necessarily mean that we have to use lots of C++ features
> everywhere.  We can use the C (almost) subset of C++ if we want to in
> some places.  As an example, if the Fortran folks want to use C in the
> Fortran front-end, then -- except to the extent required by the common
> interfaces, those files could read as if they were C code.  But, I think
> they'd still need to be compiled with a C++ compiler because they'll
> probably pull in headers that use C++ constructs.

I don't see why the implementation language of the front end should
necessarily be tied to that of the back end. One of the benefits we
should get from switching implementation language is a cleaner interface
between the language-specific parts of the compiler and the shared
back end files.

For the Ada compiler, we've never found it a disadvantage to use
Ada as implementation language, even though the back end is written 
in C. The hard parts of mapping Ada idioms to the intermediate 
languages used in the back end have never been related to the language
used for implementing the front end or back end. If anything, 
the strict separation between front end and back end data 
structures  has helped avoiding inadvertent reuse of representations
for constructs with similar yet subtly different properties.

Ideally, when having a full C++ definition of the back end interface,
we can use straight type and function definitions, instead of webs 
of header files with macro definitions. Maybe, some day, we could
even use Ada directly to interface with libbackend, eliminating the 
last remaining non-Ada code in the Ada front end.

However, it seems backwards to decide wether we want to change
implementation language before we even have an outline of a design
for a new interface, or at least some agreement on design goals.
If we do not have a path to rewriting tree.h and friends using 
C++ to raise the abstraction level and improve maintainability
of GCC (while maintaining performance and avoiding overgeneralization
and needless complexity), I am not sure the cost of moving to C++ will
result in many gains.

Once we're using C++, there will be a great temptation to use overly
complex data structures that would have been inconceivable with C.
The best defense against this is a clear design that, at least for
the most significant data structures, specifies the interface that
is going to be used. Here is where we decide how we're going to do
memory management, where we need dynamic data structures, where we
may need dispatching etc. 

If we're just going to get some new power tools for our workshop
and let people have at it, the lessons we'll learn might end up
being more about what not to do, rather than a show case of their
effective use.

In short, what we seem to be missing is a clear objective on 
what we want to accomplish by switching to C++ and how we'll
reach those goals. Without that, switching might be ill
considered and not in GCC's best interest in the long run.

  -Geert


Re: Using C++ in GCC is OK

2010-06-01 Thread Geert Bosch

On Jun 1, 2010, at 17:41, DJ Delorie wrote:

> It assumes your editor can do block-reformatting while preserving the
> comment syntax.  I've had too many // cases of Emacs guessing wrong //
> and putting // throughout a reformatted // block.

With Ada we have no choice, and only have -- comments. I don't think
I've ever encountered that kind of formatting problem, even though we
definitely have active developers using Emacs.

This seems a weak argument. Besides that, I don't feel strongly
either way. I'd just like to avoid a mixture of // and /* */
if possible.

  -Geert


Re: About stack protector

2010-06-23 Thread Geert Bosch

On Jun 23, 2010, at 22:53, Tomás Touceda wrote:

> I'm starting to dig a little bit in what gcc does to protect the stack
> from overflows and attacks of that kind. I've found some docs and
> patches, but they aren't really up to date. I thought I could get some
> diffs for the parts that manage these features, to see exactly what
> they do and what the changes are between different versions, but I'm
> finding really hard to see where exactly I should look, since there's
> a lot of work done in plenty different areas.

If you use the Ada front end, use -fstack-check and -gnato, you
should be pretty safe from any of that. Of course, their always
will be ways to shoot yourself in the foot (such as by using
unchecked conversions to turn integers into  pointers and the like),
but it will be hard to accidentally cause any memory overwriting.

  -Geert


Re: GCC and out-of-range constant array indexes?

2010-10-08 Thread Geert Bosch

On Oct 8, 2010, at 18:18, Manuel López-Ibáñez wrote:

> It is possible to do it quite fast. Clang implements all warnings,
> including Wuninitialized, in the FE using fast analysis and they claim
> very low false positives.
> However, there are various reasons why it has not been attempted in GCC:
> 
> * GCC is too slow already at -O0, slowing it down further would not be
> acceptable. So you need a really high-performing implementation.

The Ada front end has very extensive warnings. I don't think
they really contribute measurably to performance.
We don't try to construct call graphs to determine
wether the array reference will be executed or not.
If the line appears in your program, it will cause an
error if executed, so we will warn: either you wrote
dead code, or wrong code.

To avoid false positives in inlined code, code instantiated
from templates and the like, we have a notion of code that
comes from source or not. For many warnings, we will only
post the warning if the code comes from source, that is:
is not generated by the compiler as part of the compilation
process.

  -Geert


Re: PATCH RFA: Do not build java by default

2010-10-31 Thread Geert Bosch

On Oct 31, 2010, at 15:33, Steven Bosscher wrote:
> The argument against disabling java as a default language always was
> that there should be at least one default language that requires
> non-call exceptions. I recall testing many patches without trouble if
> I did experimental builds with just C, C++, and Fortran, only to find
> lots of java test suite failures in a complete bootstrap+test cycle.
> So the second point is, IMVHO, not really true.

Feel free to enable Ada. Builds and tests faster than Java, 
and is known to expose many more middle end bugs, including
ones that require non-call exceptions.

  -Geert


Re: PATCH RFA: Do not build java by default

2010-11-01 Thread Geert Bosch

On Nov 1, 2010, at 00:30, Joern Rennecke wrote:
>> Feel free to enable Ada. Builds and tests faster than Java,
>> and is known to expose many more middle end bugs, including
>> ones that require non-call exceptions.
> 
> But to get that coverage, testers will need to have gnat installed.
> Will that become a requirement for middle-end patch regression testing?

No, the language will only be built if a suitable bootstrap compiler
is present. 

  -Geert


Re: Adding Leon processor to the SPARC list of processors

2010-11-19 Thread Geert Bosch

On Nov 19, 2010, at 11:53, Eric Botcazou wrote:
>> Yes, if all the people who want only one set of libraries agree on what
>> that set shall be (or this can be selected with existing configure flags),
>> this is the simplest way.
> 
> Yes, this can be selected at configure time with --with-cpu and --with-float.
> 
> The default configuration is also straightforward: LEON is an implementation 
> of the SPARC-V8 architecture so --with-cpu=v8 and --with-float=hard.

There is LEON2, which is V7, and LEON3/LEON4, which are V8.
While LEON3 can support all of V8 in hardware, LEON3 is a 
configurable system-on-a-chip, targetting both FPGAs and ASICs, 
where users can configure and  synthesize different aspects of
the CPU:

* CONFIG_PROC_NUM: The number of processor cores.

* CONFIG_IU_V8MULDIV: Implements V8 multiply and divide instructions
  UMUL, UMULCC, SMUL, SMULCC, UDIV, UDIVCC, SDIV, SDIVCC.
  Costs about 8k gates.

* CONFIG_IU_MUL_MAC: Implements the SPARC V8e UMAC/SMAC
  (multiply-accumulate) instructions with a 40-bits accumulator

* CONFIG_FPU_ENABLE: Enable or disable floating point unit

Apart from these settings that determine wether instructions are
present at all, other settings allow selection of FPU implementation
(trading off between cycle count, area and timing), such as:

* CONFIG_IU_MUL_LATENCY_2: Implementation options for the integer multiplier.
  TypeImplementation  issue-rate/latency
  2-clocks32x32 pipelined multiplier 1/2 
  4-clocks16x16 standard multiplier  4/4
  5-clocks16x16 pipelined multiplier 4/5

* CONFIG_IU_LDELAY: One cycle load delay for best performance, or 2-cycles
  to improve timing at the cost of about 5% reduced performance.

CONFIG_FPU_ENABLE Y/N would correspond to --with-float=hard/soft, and
I believe setting CONFIG_IU_V8MULDIV to Y/N requires --with-cpu=V8/V7,
is that correct? I think it would make sense to build these as multilibs,
so the user can experiment to find out performance impacts of
the various hardware configurations on generated code.

I wonder if it also would be worthwhile to have compiler options
for fpu=fast/slow and multiply=fast/slow, so we can schedule
appropriately. For the FPU, issue-rate/latency are as follows:
  GR FPU:  1/4, with FDIV? 16 and FSQRT? 24 cycles,
non-pipelined on separate unit
  GR FPU Lite: 8/8, with FDIVS/FDIVD/FSQRTS/FSQRTD 31/57/46/57 cycles,
non-pipelined on same unit

While the FPU Lite is not pipelined, integer instructions can be
executed in parallel with a FPU instruction as long as no new FPU
instructions are pending.

  -Geert


Re: hang in acats testsuite test cxg2014 on hppa2.0w-hp-hpux11.00

2006-02-15 Thread Geert Bosch

On Feb 15, 2006, at 11:44, John David Anglin wrote:

I missed this "new" define and will try it.  Perhaps, this should
take account of the situation when TARGET_SOFT_FLOAT is true.  For
example,


When emulating all software floating-point, we still don't want to
use 128-bit floats. The whole idea is that Long_Long_Float is the
widest supported type that will still give reasonable performance.
In many cases this type is used for all computation, and changing that
to a 128-bit type is not a good idea until such a type is really
supported efficiently by hardware.

Accuracy requirements mandated by Annex G of the Ada standard
make it quite difficult to correctly implement this, since all
real and complex elementary functions will need to compute
accurate results. Because with 128-bit double extended IEEE
floating-point is supported yet, this work has not been done
yet, and it will be a tremendous effort, since there are
no good system math libraries yet.

  -Geert


Re: hang in acats testsuite test cxg2014 on hppa2.0w-hp-hpux11.00

2006-02-15 Thread Geert Bosch


On Feb 15, 2006, at 13:28, John David Anglin wrote:


Understood.  My question was what should the define for
WIDEST_HARDWARE_FP_SIZE be when generating code for a target
with no hardware floating point support (e.g., when
TARGET_SOFT_FLOAT is true)?


Practically, I'd say it should be 64, as it's a bit of a
universal assumption that you at least have 32-bit and 64-bit
float types, and possibly an 80 bit one (formatted up to 128 bits).
Of course, the idea with soft float is not to reflect reality,
but rather to have a reasonable match with expectations of the
software you'd want to run. 


Re: Ada subtypes and base types

2006-03-15 Thread Geert Bosch


On Mar 16, 2006, at 05:09, Robert Dewar wrote:

Not quite right. If you have an uninitialized variable, the value is
invalid and may be out of bounds, but this is a bounded error  
situation,

not an erroneous program. So the possible effects are definitely NOT
unbounded, and the use of such values cannot turn a program erroneous.
(that's an Ada 95 change, this used to be erroneous in Ada 83).


Actually, that's a good point and raises some potential issues:
if we're never establish the invariant that a value of a type is in
range, we can only use the base range for variables that might be
used uninitialized. Any read of such a variable would then involve
a range check.

  package Uninitialized is
 N : Positive;
  end Uninitialized;

  with Uninitialized;
  procedure Test is
 for J in 1 .. Uninitialized.N loop
...
 end loop;
  end Test;

In this case, GCC might replace the loop with
   declare
  J : Integer := 1;
   begin
  while J /= Uninitialized.N loop
 ...
 J := J + 1;
  end loop;
   end;

which would be incorrect for N = 0.

  -Geert


Re: Ada subtypes and base types

2006-03-16 Thread Geert Bosch


On Mar 16, 2006, at 10:43, Richard Guenther wrote:

Uh - what do you expect here??  Does the Ada standard
require a out-of-range exception upon the first use of N?
In this case, the frontend needs to insert a proper check.
You cannot expect the middle-end to avoid the above
transformation, so this is a frontend bug.


That's exactly the point I'm making.

  -Geert


Re: GNU Pascal branch

2006-04-03 Thread Geert Bosch


On Apr 3, 2006, at 09:34, Waldek Hebisch wrote:
2) Adjusting gpc development model. In particular, gpc uses rather  
short
   feedback loop: new features are released (as alphas) when they  
are ready.

   This is possible because gpc uses stable backend, so that users are
   exposed only to front end bugs. With development backends there  
is a

   danger that normal user will try new front end features only after
   full gcc release.


For Ada, we use the same front end sources with different back end
versions. Typically, the only changes are in the few files that convert
Ada trees to GCC trees. So, every day you build your latest gpc with
all cool new features. One build uses the latest GCC back end and
the other one uses a stable release series.

-Geert


Re: optimizing calling conventions for function returns

2006-05-25 Thread Geert Bosch


On May 23, 2006, at 11:21, Jon Smirl wrote:


A new calling convention could push two return addresses for functions
that return their status in EAX. On EAX=0 you take the first return,
EAX != 0 you take the second.


This seems the same as passing an extra function pointer
argument and calling that instead of doing a regular return.
Tail-call optimization should turn the calll into a jump.

Why do you think a custom ABI is necessary?

  -Geert


Re: optimizing calling conventions for function returns

2006-05-25 Thread Geert Bosch

On May 25, 2006, at 13:21, Jon Smirl wrote:

   jmp   *4($esp)

This is slightly faster than addl, ret.


The point is that this is only executed in the error case.

But my micro scale benchmarks are extremely influenced by changes in
branch prediction. I still wonder how this would perform in large
programs.


The jmp *4($esp) doesn't confuse the branch predictors. Basically
the assumption is that call and ret instructions match up. Your
addl, ret messes up that assumption, which means the return predictions
will all be wrong.

Maybe the future link-time optimizations might be able to handle
this kind of error-exit code automatically, but for now I think
your best bet is handling this explicitly or just not worry about
the minor inefficiency.

  -Geert


Re: [RFC] fold Reorganization Plan

2005-02-12 Thread Geert Bosch
On Feb 12, 2005, at 12:57, Nathan Sidwell wrote:
Well, it depends on the FE's language definition :)  For C and C++ the
above is not a constant-expression as the language defines it.  I can
see a couple of obvious ways to deal with this with an FE specific
constant expression evaluator,
1) during parsing set a flag if the expression contains something
not permitted for a constant-expresssion
2) a lazy folder returns 'error' when it meets something not allowed
(and if ?: is allowed, it must go down each of its branches to 
determine
if they have something banned).
Front ends should be responsible for doing any constant folding that
their language definition requires. Otherwise, you'd get the strange
situation that legality of a program depends on the strength of the
optimizers, compilation flags used or even target properties.
Your proposal to have the tree folders check wether the program
obeys C/C++ languages semantics seems fundamentally flowed.
GCC's middle- and back end should not be required to do anything for
a function that it has determined will never be called. Wether an
expression is constant for the middle end may depend on many factors,
including wether a certain function call could be expanded inline or 
not.

Constant folding as required by language standards has a very precise
definition, and does not depend on compilation options or optimization
parameters. When the FE hands of a tree to the middle end, it asserts
that the program conforms to the static semantics of the programming
language.
This gives the optimizers the freedom to do any transformations, as
long as it conforms to the language-independent definition of GIMPLE.
  -Geert


Re: [RFC] fold Reorganization Plan

2005-02-12 Thread Geert Bosch
On Feb 12, 2005, at 14:57, Nathan Sidwell wrote:
I entirely agree.  Unfortunately what we have now is not that --
fold is doing both optimization and (some) C & C++ semantic stuff.
Your proposal to have the tree folders check wether the program
obeys C/C++ languages semantics seems fundamentally flowed.
OK, we're in violent agreement then! :)
I misunderstood your message.
That is not my proposal.  I'm sorry if I gave the impression it was,
but it isn't.  (What I meant by a tree folders in that regard
was an FE-specific folder.)
OK, then things make a lot more sense!


Re: [RFC] fold Reorganization Plan

2005-02-12 Thread Geert Bosch
On Feb 12, 2005, at 15:58, Richard Kenner wrote:
As several front-end people have suggested, calling fold whilst
constructing parse trees shouldn't be necessary (as shown by the
shining examples of g77 and GNAT).
I don't follow.  GNAT certainly calls fold for every expression it 
makes.
But GNAT doesn't rely on GCC for constant folding static expressions,
or  even call the back end before semantic analysis has finished!
This is what we're talking about.
  -Geert


Re: Ada totally borken on x86-linux

2005-02-18 Thread Geert Bosch
Jason,
Your patch has caused a lot of breakage for many platforms
and languages. It seems clear that it is far too intrusive
to apply in this stage.
Please revert your patch.
Thanks in advance,
  -Geert
On Feb 18, 2005, at 12:14, Eric Botcazou wrote:
Regression went from 16 to 143:
http://gcc.gnu.org/ml/gcc-testresults/2005-02/msg00758.html
Yup.
Any idea of what may have caused this?
2005-02-17  Jason Merrill  <[EMAIL PROTECTED]>
PR mudflap/19319, c++/19317
* gimplify.c (gimplify_modify_expr_rhs) [CALL_EXPR]: Make return
slot explicit.



Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-03 Thread Geert Bosch
On Apr 3, 2013, at 11:27, Simon Baldwin  wrote:
> Suppose you had a month in which to reorganise gcc so that it builds
> its 3-stage bootstrap and runtime libraries in some massively parallel
> fashion, without hardware or resource constraints(*).  How might you
> approach this?

One of the main problems in large build machines is that a few steps
in the compilation take a very long time, such as compiling
insn-recog.o (1m30) and insn-attrtab.o (2m05). This is on our largish
48-core AMD machine. Also genattrtab and genautomata are part of
the critical path, IIRC. These compilations, which are repeated
during the bootstrap, take a significant part of the total sequential
bootstrap time of about 20 min real, 100 min user and 10 min sys
on this particular machine.

I think the easiest way in general to achieve more parallelization
during the bootstrap is to speculatively reuse the old result (.o
file or other output) and in parallel verify that, yes, eventually
we produce the same result. This has to be done carefully, so that
we don't accidentally skip the verification, negating the self-testing
purpose of the bootstrap.

  -Geert



Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-03 Thread Geert Bosch

On Apr 3, 2013, at 23:44, Joern Rennecke  wrote:

> How does that work?
> The binaries have to get the all the machines of the clusters somewhere.
> Does this assume you are using NFS or similar for your build directory?
> Won't the overhead of using that instead of local disk kill most of the
> parallelization benefit of a cluster over a single SMP machine?

This will be true regardless of communication method. There is so little
opportunity for parallelism that anything more than 4-8 local cores is
pretty much wasted. On a 4-core machine, more than 50% of the wall time 
is spent on things that will not use more than those 4 cores regardless.
If the other 40-50% or so can be cut by a factor 4 compared to 4-core
execution, we still are talking about at most a 30% improvement on the 
total wall time. Even a small serial overhead for communicating sources
and binaries will still reduce this 30%.

We need to improve the Makefiles before it makes sense to use more
parallelism.  Otherwise we'll just keep running into Amdahl's law.

  -Geert




Re: If you had a month to improve gcc build parallelization, where would you begin?

2013-04-09 Thread Geert Bosch

On Apr 9, 2013, at 22:19, Segher Boessenkool  wrote:

> Some numbers, 16-core 64-thread POWER7, c,c++,fortran bootstrap:
> -j6:  real57m32.245s
> -j60: real38m18.583s

Yes, these confirm mine. It doesn't make sense to look at more
parallelization before we address the serial bottlenecks.
The the -j6 parallelism level is about where current laptops
are. Having a big machine doesn't do as much as having fewer,
but faster cores. 

We should be able to do far better. I don't know how the Power7
threads compare in terms of CPU throughput, but going from -j6 to
-j48 on our 48-core AMD system should easily yield a 6x speed up
as all are full cores, but we get similar limited improvements to
yours, and we get almost perfect scaling in many test suite runs
that are dominated by compilations.

The two obvious issues:
  1. large sequential chains of compiling/running genattrtab followied
 by compiling insn-attrtab.c and linking the compiler
  2. repeated serial configure steps

For 1. we need to somehow split the file up in smaller chunks.
For 2. we need to have efficient caching.

Neither is easy...

  -Geert



Re: return statement in a function with the naked attribute

2013-05-02 Thread Geert Bosch

On May 3, 2013, at 00:15, reed kotler  wrote:

> There was some confusion on the llvm list because some tests were run on 
> targets that did not support the naked attribute.
> 
> I think we are thinking now that the return statement should not be emitted 
> unless explicitly requested.
> 
> It's not totally clear in the gcc manual so that is why I was asking.

I clearly is an error to have a function that doesn't return.
So, what you really asking is: "What are the semantics of a (naked)
functon that doesn't return?"

I think it would make sense for a compiler to emit a special error-return
(such as abort()) at the end of such a function, with the expectation that
this code usually would be unreachable and optimized away.

I don't think it makes sense to try and define any other semantics
for a funciotn that doesn't explicitly return a value. 

  -Geert

Re: [RFC] Detect most integer overflows.

2013-11-08 Thread Geert Bosch

On Oct 29, 2013, at 05:41, Richard Biener  wrote:

> For reference those
> (http://clang.llvm.org/docs/LanguageExtensions.html) look like
> 
>  if (__builtin_umul_overflow(x, y, &result))
>return kErrorCodeHackers;
> 
> which should be reasonably easy to support in GCC (if you factor out
> generating best code and just aim at compatibility).  Code-generation
> will be somewhat pessimized by providing the multiplication result
> via memory, but that's an implementation detail.

I've done the overflow checking in Gigi (Ada front end). Benchmarking
real world large Ada programs (where every integer operation is checked,
including array index computations etc.), I found the performance cost 
*very* small (less than 1% on typical code, never more than 2%). There
is a bit more cost in code size, but that is mostly due to the fact that
we want to generate error messages with correct line number information
without relying on backtraces.

The rest of the run time checks in Ada (especially index checks and range
checks) were far more costly (more on the order of 10-15%, but very
variable depending on code style).

A few things helped to make the cost small: the biggest one is that
typically on of the operands is known to be negative or positive.
Gigi will use Ada type information, and Natural or Positive integer
variables are very common.  So, if you compute X + C with C positive, 
you can write the conditional expression as:
(if X < Integer'Last - C then X + C else raise Constraint_Error)

On my x86-64 this generates something like:
__ada_add:
00  cmpl$0x7fff,%edi
06  je  0x000c
08  leal0x01(%rdi),%eax
0b  ret
0c  leaq0x000d(%rip),%rdi
13  pushq   %rax
14  movl$0x0003,%esi
19  xorl%eax,%eax
1b  callq   ___gnat_rcheck_CE_Overflow_Check

While this may look like a lot, these operations are expanded
inline, and only the first three are on the normal execution
path. As the exception raise is a No_Return subprogram, it will
be moved to the end of the file. The jumps will both statically
and dynamically be treated as not-taken, and have very little cost.

Additionally, the comparison is visible for the optimizers, in
effect giving more value range information which can be used
for optimizing away further checks. The drawback of using any
"special" new operations is that we loose that aspect.

For the less common case in which neither operand has a known
sign, widening to 64-bits is the straightforward solution. For Ada,
we have a mode where we do this kind of widening for entire
expressions, so we only have to check on the final assignment.
The semantics here are that you'd get the mathematically correct
result, even if there was an intermediate overflow. The drawback
of this approach is that an overflow check may not fail, but
that suppressing the checks removes the widening and causes
wrong answers.

  -Geert




Re: [RFC] Detect most integer overflows.

2013-11-26 Thread Geert Bosch

On Nov 9, 2013, at 02:48, Ondřej Bílka  wrote:
>> I've done the overflow checking in Gigi (Ada front end). Benchmarking
>> real world large Ada programs (where every integer operation is checked,
>> including array index computations etc.), I found the performance cost 
>> *very* small (less than 1% on typical code, never more than 2%). There
>> is a bit more cost in code size, but that is mostly due to the fact that
>> we want to generate error messages with correct line number information
>> without relying on backtraces.
>> 
> Overhead is mostly from additonal branches that are not taken. We need
> more accurate measure of cache effects than code size, for example
> looking to increased number icache hits which will not count code that
> is never executed.

Indeed, and execution time testing shows there isn't a significant
increase in icache pressure.

>> [...]
>> A few things helped to make the cost small: the biggest one is that
>> typically on of the operands is known to be negative or positive.
>> Gigi will use Ada type information, and Natural or Positive integer
>> variables are very common.  So, if you compute X + C with C positive, 
>> you can write the conditional expression as:
> 
> On x64 efect of this analysis is small, processor does overflow detection for 
> you.

The problem as always is that of pass ordering. If we would encapsulate
the overflow check some kind of builtin that we'd directly very late in
the compilation process to processor-specific instructions, then early 
optimizers cannot do their work.

> By just expanding out to regular additions, comparisons and control
> flow we can avoid this problem.
> 
>> (if X < Integer'Last - C then X + C else raise Constraint_Error)
>> 
> 
>> On my x86-64 this generates something like:
>> 00   cmpl$0x7fff,%de
>> 06   je  0x000c
>> 08   leal0x01(%rdi),%eax
>> 0b   ret
>> [..]
> 
> This has redundant compare instruction that cost a cycle and 6 bytes.
> You can just write
> 
>   0:  83 c7 01add$0x1,%edi
>   3:  71 03   jno0x8
> [...]
> When you know that one operand is positive or you deal with unsigned
> then you could replace jno with jnc which is bit faster on sandy bridge
> processors and later as add, jnc pair is macro-fused but add jno is not.

Indeed that is ideal, assuming condition flags are dead, but I think
that the right way to do that is by late combine-like optimizations
after instruction selection.

In looking at generated code in actual programs, most checks are
optimized away. This is more important than saving a cycle here or
there in the much smaller number of checks that remain. After all,
we all "know" that our programs will never fail any overflow checks,
so it is just a matter of the compiler to be smart enough to prove
this. While it won't achieve that goal (halting problem etc), for
a large fraction localized analysis is sufficient to prove checks
cannot fail. I'm afraid that premature optimization will always be
a loss here.

Probably combine, in combination with some machine-specific instruction
patterns, could be taught to do some of the optimizations you
mention, and that would have as advantage that they'd be also
applicable to manual tests people write.

  -Geert



Re: weird optimization in sin+cos, x86 backend

2012-02-09 Thread Geert Bosch

On Feb 9, 2012, at 08:46, Andrew Haley wrote:
> n 02/09/2012 01:38 PM, Tim Prince wrote:
>> x87 built-ins should be a fair compromise between speed, code size, and 
>> accuracy, for long double, on most CPUs.  As Richard says, it's 
>> certainly possible to do better in the context of SSE, but gcc doesn't 
>> know anything about the quality of math libraries present; it doesn't 
>> even take into account whether it's glibc or something else.
> 
> Yeah.  On x86_64 glibc, we should really get the sincos bg fixed.  And
> it really is a bug: cos() and sin() are correctly rounded but sincos()
> isn't.

It seems to be a persistent problem that we need to rely on a system-provided
C library, instead of a math library that is installed with the compiler.

For non-C languages, such as Ada, we need to meet specific accuracy
requirements. However, we cannot distribute the latest glibc with the
compiler for each target platform. In particular, it is common for
users to install new compilers on (very) old OS versions. Even today,
we cannot rely on the standard libm to be C99 compliant and provide
a relative error of 2 epsilon (2-4 ulps) or better on ANY platform.
So, the only way to have a conforming Ada implementation is to roll
our own. It would be so much better if we could work on a common
library that can be used by all languages.

Given the fact that GCC already needs to know pretty much everything
about these functions for optimizations and constant folding, and is
in the best situation to choose specific implementations (-ffast-math
or not, -frounding-math or not, -ftrapping-math or not, etc), specific
optimizations (latency/throughput/vector optimized)  and perform 
(partial) inlining, shouldn't we have a math library directly within GCC?

  -Geert


Re: weird optimization in sin+cos, x86 backend

2012-02-09 Thread Geert Bosch

On Feb 9, 2012, at 10:28, Richard Guenther wrote:
> Yes, definitely!  OTOH last time I added the toplevel libgcc-math directory
> and populated it with sources from glibc RMS objected violently and I had
> to remove it again.  So we at least need to find a different source of
> math routines to start with (with proper licensing terms).
Right, I new there had been issues like this with the glibc stuff.
While I'm open to trying to deal with RMS on this, we indeed might want
to go a different route. So much for free software.
> 
> I'd definitely like to see this library also to host vectorized routines and
> provide a myriad of entry-points.
Exactly. In particular, I'd expect to see dozens of variations of some
important performance-sensitive functions like sin and cos, while others
will just have one reasonably accurate machine-independent reference
implementation.

> So - do you have an idea what routines we can start off with to get
> a full C99 set of routines for float, double and long double?  The last
> time I was exploring the idea again I was looking at the BSD libm.

I would think fdlibm might be a good starting point for double. I don't
know of any appropriately licensed library that provides good functions
for all types.

Maybe the answer is to provide an infrastructure to start adding
functions, falling back on libm where needed. We'll need to think
about how to accommodate multiple versions with varying accuracy
and speed and how to select from these.

I think it would make sense to have a check list of properties, and
use configure-based tests to categorize implementations. These tests
would be added as we go along.

Criteria:

 [ ] Conforms to C99 for exceptional values 
 (accepting/producing NaNs, infinities)

 [ ] Handles non-standard rounding modes,
 trapping math, errno, etc.

 [ ] Meets certain relative error bounds,
 copy from Ada Annex G numerical requirements
 (typically between 2 eps and 4 eps for much of range)

 [ ] Guaranteed error less than 1 ulp over all arguments,
 typical max. error close to 0.5 ulp.

 [ ] Correctly rounded for all arguments

While I think it would be great if there were a suitable
GNU libm project that we could directly use, this seems to only
make sense if this could be based on the current glibc math
library. As far as I understand, it is unlikely that we
can make this happen in the near future.

However, even if that were possible, I think it is necessary for 
us to be able to inline math functions directly in user programs, 
without those programs having to be released under the GPL.

It just isn't possible to get best-in-class performance by calling
external functions. We need to be able to inline all code and fully
optimize it using all our existing and future optimizers.

  -Geert


Re: weird optimization in sin+cos, x86 backend

2012-02-09 Thread Geert Bosch

On Feb 9, 2012, at 12:55, Joseph S. Myers wrote:

> No, that's not the case.  Rather, the point would be that both GCC's 
> library and glibc's end up being based on the new GNU project (which might 
> take some code from glibc and some from elsewhere - and quite possibly 
> write some from scratch, taking care to ensure new code is properly 
> commented and all constants are properly explained with free software code 
> available to calculate all the precomputed tables).  (Is 
> sysdeps/ieee754/dbl-64/uatan.tbl in glibc - a half-megabyte file of 
> precomputed numbers - *really* free software?  No doubt it wouldn't be 
> hard to work out what the numbers are from the comment "x0,cij for 
> (1/16,1)" and write a program using MPFR to compute them - but you 
> shouldn't need such an exercise to figure it out, such generated tables 
> should come with free software to generate them since the table itself 
> certainly isn't the preferred form for modification.)

I don't agree having such a libm is the ultimate goal. It could be
a first step along the way, addressing correctness issues. This
would be great progress, but does not remove the need for having
at least versions of common elementary functions directly integrated
with GCC.

In particular, it would not address the issue of performance. For 
that we need at least a permissive license to allow full inlining, 
but it seems unlikely to me that we'd be able to get glibc code 
under those conditions any time soon.

I'd rather start collecting suitable code in GCC now. If/when a 
libm project materializes, it can take our freely licensed code 
and integrate it. I don't see why we need to wait. 
Still, licensing is not the only issue for keeping this in GCC.

From a technical point of view, I see many reasons to tightly couple
implementations of elementary functions with the compiler like we
do now for basic arithmetic in libgcc. On some targets we will
want to directly implement elementary functions in the back end,
such as we do for sqrt on ia64 today.  In other cases we may want
to do something similar in machine independent code.

I think nowadays it makes as little sense to have sqrt, sin, cos,
exp and ln and other common simple elementary functions in an
external library, as it would be for multiplication and division to
be there. Often we'll want to generate these functions inline taking
advantage of all knowledge we have about arguments, context, rounding
modes, required accuracy, etc.

So we need both an accurate libm, and routines with a permissive
license for integration with GCC.

  -Geert


Re: weird optimization in sin+cos, x86 backend

2012-02-10 Thread Geert Bosch

On Feb 9, 2012, at 15:33, Joseph S. Myers wrote:
> For a few, yes, inline support (such as already exists for some functions 
> on some targets) makes sense.  But for some more complicated cases it 
> seems plausible that LTO information in a library might be an appropriate 
> way of inlining while allowing the functions to be written directly in C.  
> (Or they could be inline in a header provided with GCC, but that would 
> only help for use in C and C++ code.)
But neither LTO nor header-based inlining would be compatible with LGPL, right?

>> In particular, it would not address the issue of performance. For 
>> that we need at least a permissive license to allow full inlining, 
>> but it seems unlikely to me that we'd be able to get glibc code 
>> under those conditions any time soon.
> 
> Indeed, if we can't then other sources would need using for the functions.  
> But it seems worth explaining the issues to RMS - that is, that there is a 
> need for permissively licensed versions of these functions in GNU, the 
> question is just whether new ones are written or taken from elsewhere or 
> whether glibc versions can be used for some functions.

I'm skeptical that talking to RMS will result in relicensing of these glibc
functions. If we're writing new ones (or incorporate libraries such as fdlibm),
we don't need permission.

>> I'd rather start collecting suitable code in GCC now. If/when a 
> 
> Whereas I'd suggest: start collecting in a separate project and have GCC 
> importing from that project from the start.
That seems mostly like a matter of terminology. We can declare that
the new libm project currently lives in gcc/libm, has a webpage at
gcc.gnu.org/wiki/LibM and uses the GCC mailinglists with [libm] in
the subject. It's fine with me either way.

>> libm project materializes, it can take our freely licensed code 
>> and integrate it. I don't see why we need to wait. 
> 
> I don't think of it as waiting - I think of it as collecting the code now, 
> but in an appropriate place, and written in an appropriate way (using 
> types such as binary32/binary64/binary128 rather than particular names 
> such as float/double/long double, recording assumptions such as whether 
> the code will work on x87 with excess precision and what it needs for 
> rounding modes / exceptions / subnormals support).  Aim initially for GCC 
> use, but try to put things in a more generally usable form.

Agreed, we probably should do use such types regardless of where
the library resides. Note that for some functions we may want to
have dozens of implementations, or rather a few implementations
that can be customized in myriad ways. In all cases I would expect
we would provide ways of interfacing to be at source level, or
a precompiled static library with LTO bytecode, in addition to 
libraries using the existing libm interface.

>> I think nowadays it makes as little sense to have sqrt, sin, cos,
>> exp and ln and other common simple elementary functions in an
>> external library, as it would be for multiplication and division to
>> be there. Often we'll want to generate these functions inline taking
> 
> I've previously suggested GNU libm (rather than glibc) as a natural master 
> home for the soft-fp code that does provide multiplication and division 
> for processors without them in hardware
> 
> (Different strategies would obviously be optimal for many functions on 
> soft-float targets, although I doubt there's much demand for implementing 
> them since if you care much for floating-point performance you'll be using 
> a processor with hardware floating point.)


The same reasoning goes here: it would be best if we have a flexible
interface with the compiler, so we can for example have entry points
that accept and return soft-floats in "unpacked" form. Wether performance
of soft-float is important or not is debatable.

  -Geert


Re: weird optimization in sin+cos, x86 backend

2012-02-10 Thread Geert Bosch

On Feb 10, 2012, at 05:07, Richard Guenther wrote:

> On Thu, Feb 9, 2012 at 8:16 PM, Geert Bosch  wrote:
>> I don't agree having such a libm is the ultimate goal. It could be
>> a first step along the way, addressing correctness issues. This
>> would be great progress, but does not remove the need for having
>> at least versions of common elementary functions directly integrated
>> with GCC.
>> 
>> In particular, it would not address the issue of performance. For
>> that we need at least a permissive license to allow full inlining,
>> but it seems unlikely to me that we'd be able to get glibc code
>> under those conditions any time soon.
> 
> I don't buy the argument that inlining math routines (apart from those
> we already handle) would improve performance.  What will improve
> performance is to have separate entry points to the routines
> to skip errno handling, NaN/Inf checking or rounding mode selection
> when certain compilation flags are set.  That as well as a more
> sane calling convention for, for example sincos, or in general
> on x86_64 (have at least _some_ callee-saved XMM registers).

I'm probably oversimplifying a bit, but I see extra entry points as
being similar to inlining. When you write sin (x), this is a function 
not only of x, but also of the implicit rounding mode and special 
checking options. With that view specializing the sin function for 
round-to-even is essentially a form of partial inlining.

Also, evaluation of a single math function typically has high
latency as most instructions depend on results of previous instructions.
For optimal throughput, you'd really like to be able to schedule
these instructions. Anyway, those are implementation details.

The main point is that we don't want to be bound by a
frozen interface with the math routines. We'll want to change
the interface when we have new functionality or optimizations.

> The issue with libm in glibc here is that Drepper absolutely does
> not want new ABIs in libm - he believes that for example vectorized
> routines do not belong there (nor the SSE calling-convention variants
> for i686 I tried to push once).

Right. I even understand where he is coming from. Adding new interfaces
is indeed a big deal as they'll pretty much have to stay around forever.
We need something more flexible that is tied to the compiler and not
a fixed interface that stays constant over time. Even if we'd add things
to glibc now, it takes many years for that to trickle down. Of course,
a large number of GCC uses doesn't use glibc at all.

>> I'd rather start collecting suitable code in GCC now. If/when a
>> libm project materializes, it can take our freely licensed code
>> and integrate it. I don't see why we need to wait.
>> Still, licensing is not the only issue for keeping this in GCC.
> 
> True.  I think we can ignore glibc and rather think as of that newlib
> might use it if we're going to put it in toplevel src.

Agreed.

>> So we need both an accurate libm, and routines with a permissive
>> license for integration with GCC.
> 
> And a license, that for example allows installing a static library variant
> with LTO bytecode embedded so we indeed _can_ inline and re-optimize it.
> Of course I expect the fastest paths to be architecture specific assembly
> anyway ...

Yes, that seems like a good plan. Now, how do we start this?
As a separate project or just as part of GCC?

  -Geert



Re: weird optimization in sin+cos, x86 backend

2012-02-13 Thread Geert Bosch


> On 2012-02-09 12:36:01 -0500, Geert Bosch wrote:
>> I think it would make sense to have a check list of properties, and
>> use configure-based tests to categorize implementations. These tests
>> would be added as we go along.
>> 
>> Criteria:
>> 
>> [ ] Conforms to C99 for exceptional values 
>> (accepting/producing NaNs, infinities)
> 
> C is not the only language. Other languages (e.g. Java) may have
> different rules. And IEEE 754-2008 also specifies different rules.
> And you may want to consider LIA-2 too, which is again different...

True, but I tried to keep the number of criteria small. Actually, 
my main interest here is allowing conformance with the Ada standard,
see chapter G.2 
(http://www.adaic.org/resources/add_content/standards/05rm/html/RM-G-2.html),
so I'm not looking at this with a "C-only" mindset. :-)

However, even though the Ada requirements are different, it would
still be tremendously helpful to be able to rely on the C99 standard,
and then just deal with other cases separately. But yes, we might
want to include even more configurability.

>> [ ] Handles non-standard rounding modes,
>> trapping math, errno, etc.
> 
> By "non-standard rounding modes", I assume you mean "non-default
> rounding modes".
Yes, indeed.
> 
>> [ ] Meets certain relative error bounds,
>> copy from Ada Annex G numerical requirements
>> (typically between 2 eps and 4 eps for much of range)
> 
> FYI, OpenCL also has numerical requirements.
They seem similar, though only apply to single precision,
while the rules defined by Ada are parametrized by type.
Where OpenCL specifies an error as 4 ulps, Ada specifies a
relative error of 2 epsilon. For values just below a
power of 2, 2 epsilon equals 4 ulps, while for values 
just above they are ulps and eps are equivalent.
So, the Ada requirements are a bit tighter in general.
Probably we can define something that is the union
of both sets of requirements. 

>> [ ] Guaranteed error less than 1 ulp over all arguments,
>> typical max. error close to 0.5 ulp.
> 
> Instead, I would say faithful rounding (this is slightly stricter
> for results close to powers of 2).
Yes, that is better.
> 
>> [ ] Correctly rounded for all arguments
> 
> I would add:
> 
>   [ ] Symmetry (e.g. cos(-x) = cos(x), sin(-x) = - sin(x)) in the
>   symmetric rounding modes.
> 
>   [ ] Monotonicity (for monotonous functions).
> 
> (note that they are implied by correct rounding).
Right, my goal was to have just a few different buckets that fit
well with GCC compilation options and allow categorization of 
existing functions. 

I guess, it should be a few check boxes and a numerical level.
This would be a categorization for source code, not compiled
code where me might have additional copies based on optimization
(size/latency/throughput/vectorization).

Properties:

  [ ]  Conforms to C99 for exceptional values 
   (accepting/producing NaNs, infinities)

  [ ]  Handles non-default rounding modes,
   trapping math, errno, etc.

  [ ]  Requires IEEE compliant binary64 arithmetic
   (no implicit extended range or precision)

  [ ]  Requires IEEE compliant binary80 arithmetic
   (I know, not official format, but YKWIM)
 
Accuracy level:

  0 - Correctly rounded

  1 - Faithfully rounded, preserving symmetry and monotonicity

  2 - Tightly approximated, meeting prescribed relative error
  bounds. Conforming to OpenCL and Ada Annex G "strict mode"
  numerical bounds.

  3 - Unspecified error bounds

Note that currently of all different operating systems we (AdaCore)
support for the GNAT compiler, I don't know of any where we can rely
on the system math library to meet level 2 requirements for all
functions and over all of the range! Much of this is due to us 
supporting OS versions as long as there is vendor support, so while 
SPARC Solaris 9 and later are fine, Solaris 8 had some issues. 
GNU Linux is quite good, but has issues with the "pow" function for
large exponents, even in current versions, and even though Ada
allows a relative error of (4.0 + |Right · log(Left)| / 32.0) 
for this function, or up to 310 ulps at the end of the range.
Similarly, for trigonometric functions, the relative error for 
arguments larger than some implementation-defined angle threshold
is not specified, though the angle threshold needs to be at
least radix**(mantissa bits / 2), so 2.0**12 for binary32 or 2.0**32
for binary80. OpenCL doesn't specify an angle threshold, but I
doubt they intend to require sufficient accurate reduction over
the entire range to have a final error of 4 ulps: that doesn't fit
with the rest of the requirements.

The Ada test suite (ACATS) already has quite extensive tests for (2),
which are automatically parame

Re: weird optimization in sin+cos, x86 backend

2012-02-14 Thread Geert Bosch

On Feb 14, 2012, at 08:22, Vincent Lefevre wrote:
> Please do not use the term binary80, as it is confusing (and
> there is a difference between this format and the formats of
> the IEEE binary{k} class concerning the implicit bit).
Yes, I first wrote extended precision, though that really is
a general term that could denote many different formats.
I'll write Intel extended precision in the future. :)
>   
> IEEE 754 recommends correct rounding (which should not be much slower
> than a function accurate to a few ulp's, in average) in the full range.
> I think this should be the default. The best compromise between speed
> and accuracy depends on the application, and the compiler can't guess
> anyway.

While I'm sympathetic to that sentiment, the big issue is that we
don't have a correctly rounded math library supporting all formats
and suitable for all targets. We can't default to something we
don't have.

Right now we don't have a library either that conforms to C99
and meets the far more relaxed accuracy criteria of OpenCL and
Ada.

However, the glibc math library comes very close, and we can
surely fix any remaining issues there may be. So, if we can
use that as base, or as "fallback" library, we suddenly
achieve some minimal accuracy guarantees across a wide
range of platforms. If we can get this library with
GPL+exception, we can even generate optimized variants
and use a static library with LTO byte code allowing for
inlining etc.

Then we can collect/write code that improves on this libm.

  -Geert


Re: weird optimization in sin+cos, x86 backend

2012-02-14 Thread Geert Bosch

On Feb 14, 2012, at 11:44, Andrew Haley wrote:

> On 02/14/2012 04:41 PM, Geert Bosch wrote:
>> Right now we don't have a library either that conforms to C99
> 
> Are you sure?  As far as I know we do.  We might not meet
> C99 Annex F, but that's not required.
> 
>> and meets the far more relaxed accuracy criteria of OpenCL and
>> Ada.
Note the conjunctive "and" here. I was just replying to Vincent
that it doesn't make sense to default to correctly rounded math
yet, as we don't have such a thing.

I think it is feasible to integrate a libm meeting minimal
accuracy requirements, as well as variations that additionally
give much improved performance when non-default rounding modes,
trapping and errno setting are not needed. It still seems
like glibc's libm is the best candidate to use a base.

  -Geert


Re: weird optimization in sin+cos, x86 backend

2012-02-14 Thread Geert Bosch

On Feb 13, 2012, at 09:59, Vincent Lefevre wrote:

> On 2012-02-09 15:49:37 +, Andrew Haley wrote:
>> I'd start with INRIA's crlibm.
> 
> I point I'd like to correct. GNU MPFR has mainly (> 95%) been
> developed by researchers and engineers paid by INRIA. But this
> is not the case of CRlibm. I don't know its copyright status
> (apparently, mainly ENS Lyon, and the rights have not been
> transferred to the FSF).
> 
> Also, from what I've heard, CRlibm is more or less regarded as
> dead, because there are new tools that do a better job, and new
> functions could be generated in a few hours. I suppose a member
> (not me since I don't work on these tools) or ex-member of our
> team will contact some of you about this.

Ideally, we would need to include both the generated functions
as well as identification of the tools used and source code or
scripts used with those tools.

  -Geert


  1   2   >