Re: help for arm avr bfin cris frv h8300 m68k mcore mmix pdp11 rs6000 sh vax

2009-03-16 Thread Martin Guy
On 3/14/09, Paolo Bonzini  wrote:
> Hans-Peter Nilsson wrote:
>  > The answer to the question is "no", but I'd guess the more
>  > useful answer is "yes", for different definitions of "truncate".
>
> Ok, after my patches you will be able to teach GCC about this definition
>  of truncate.

I expect it's a bit too extreme an example, but I've just found (to my
horror) that the MaverickCrunch FPU truncates all its shift counts to
6-bit signed (-32(right) to +31(left)), including on 64-bit integers,
which is not very helpful to compile for.
...unless it happens to come easy to handle "shift count is truncated
to less than size of word" in your new framework

M


Re: help for arm avr bfin cris frv h8300 m68k mcore mmix pdp11 rs6000 sh vax

2009-03-16 Thread Paolo Bonzini
Martin Guy wrote:
> On 3/14/09, Paolo Bonzini  wrote:
>> Hans-Peter Nilsson wrote:
>>  > The answer to the question is "no", but I'd guess the more
>>  > useful answer is "yes", for different definitions of "truncate".
>>
>> Ok, after my patches you will be able to teach GCC about this definition
>>  of truncate.
> 
> I expect it's a bit too extreme an example, but I've just found (to my
> horror) that the MaverickCrunch FPU truncates all its shift counts to
> 6-bit signed (-32(right) to +31(left)), including on 64-bit integers,
> which is not very helpful to compile for.
> ...unless it happens to come easy to handle "shift count is truncated
> to less than size of word" in your new framework

Uhm, well, no. :-)

This could already be handled by faking a 63 bit truncation and using a
splitter to expand those into something like this (I only know integer
ARM assembly, so I'm making this up):

   AND R1, R0, #31
   MOV R2, R2, SHIFT R1
   ANDS R1, R0, #32
   MOVNE R2, R2, SHIFT #31
   MOVNE R2, R2, SHIFT #1

or

   ANDS R1, R0, #32
   MOVNE R2, R2, SHIFT #-32
   SUB R1, R1, R0  ; R1 = (x >= 32 ? 32 - x : -x)
   MOV R2, R2, SHIFT R1

(which requires a scratch register, so it cannot be done postreload...
this might be a problem)

But my new stuff won't change anything.

Paolo


Re: Preprocessor for assembler macros?

2009-03-16 Thread Ph . Marek
Philipp Marek  marek.priv.at> writes:
> > gcc -S tmp.S for some reason prints to stdout, so gcc -S tmp.S > tmp.s
> > is what you need
> Thank you very much, I'll take a look.
I tried very hard to achieve that; and one time it seemed to work, but I cannot
make it work again.

As an example I'm trying to expand the macros in the linux kernel source file
   arch/x86/kernel/entry_64.S

I tried to call "gcc -S", to put the various "-I.." paths as needed, and I even
renamed my "as" to "as.bin" and tried to get the assembler source directly (by
using "gcc -S $COLLECT_GCC_OPTIONS sourcefile") ...

I cannot make it work again ...


Do you have some other hint for me?

Thank you very much.


Regards,

Phil




Re: -mfpmath=sse,387 is experimental ?

2009-03-16 Thread Zuxy Meng

Hi,

"Timothy Madden"  写入消息 
news:5078d8af0903120218i23b69a4bma28ad9b3f1bd4...@mail.gmail.com...

On Thu, Mar 12, 2009 at 1:15 AM, Jan Hubicka  wrote:

Timothy Madden wrote:
> Hello
>
> Is -mfpmath=both for i386 and x86-64 still experimental in gcc 4.3, as
> the in the online manual page ?

[...]


The fundamental problem here is that backend lies to compiler about the
fact that FP operation can not take one operand from SSE and other from
X87.  This is something I want to look into once I have more time.  With
new RA, perhaps we can drop all these fake constraints.


That would be great !
I am sure having twice the number of registers (sse+387) would make a
big difference.

Even if SSE and FPU instructions set can not mix operands, using both
at the same
time (each with its registers) will be an improvement.

Until then I would have a question: if I compile with -msse than using
-mfpmath=387
would help floating-point operations not steal SSE registers that are
already used by
CPU operations ? And using -mfpmath=sse would make FPU and CPU share the 
SSE

registers and compete on them ?

How would I know if my AMD Sempron 2200+ has separate execution units
for SSE and
FPU instructions, with independent registers ?


Most CPU use the same FP unit for both x87 and SIMD operations so it 
wouldn't give you double the performance. The only exception I know of is 
K6-2/3, whose x87 and 3DNow! units are separate.


--
Zuxy 





Re: GCC 4.4.0 Status Report (2009-03-13)

2009-03-16 Thread Paolo Bonzini
NightStrike wrote:
> On Fri, Mar 13, 2009 at 1:58 PM, Joseph S. Myers
>  wrote:
>> Given the SC request we need to stay in Stage 4 rather than trying to work
>> around it.
> 
> What if GCC went back to stage 3 until the issue is resolved, thus
> opening the door for a number of stage3-type patches that don't affect
> 1) licensing and 2) plugin frameworks, but are merely bug fixes which
> would have long been shaken out by now.

No, not at all.  The only benefit we're having from this is that GCC 4.4
should be quite stable already in GCC 4.4.0, let's not destroy this one too.

Paolo


Re: help for arm avr bfin cris frv h8300 m68k mcore mmix pdp11 rs6000 sh vax

2009-03-16 Thread Martin Guy
On 3/16/09, Paolo Bonzini  wrote:
>AND R1, R0, #31
>MOV R2, R2, SHIFT R1
>ANDS R1, R0, #32
>MOVNE R2, R2, SHIFT #31
>MOVNE R2, R2, SHIFT #1
>
>  or
>
>ANDS R1, R0, #32
>MOVNE R2, R2, SHIFT #-32
>SUB R1, R1, R0  ; R1 = (x >= 32 ? 32 - x : -x)
>MOV R2, R2, SHIFT R1

Thanks for the tips. Yes, I was contemplating cooking up something
like that, hobbled by the fact that if you use maverick instructions
conditionally you either have to put seven nops either side of them or
risk death by astonishment.

M


Re: -mfpmath=sse,387 is experimental ?

2009-03-16 Thread Tim Prince
Zuxy Meng wrote:
> Hi,
> 
> "Timothy Madden"  写入消息
!
>> I am sure having twice the number of registers (sse+387) would make a
>> big difference.
You're not counting the rename registers, you're talking about 32-bit mode
only, and you're discounting the different mode of accessing the registers.

>>
>> How would I know if my AMD Sempron 2200+ has separate execution units
>> for SSE and
>> FPU instructions, with independent registers ?
> 
> Most CPU use the same FP unit for both x87 and SIMD operations so it
> wouldn't give you double the performance. The only exception I know of
> is K6-2/3, whose x87 and 3DNow! units are separate.
> 
-march=pentium-m observed the preference of those CPUs for mixing the
types of code.  This was due more to the limited issue rate for SSE
instructions than to the expanded number of registers in use.  You are
welcome to test it on your CPU; however, AMD CPUs were designed to perform
well with SSE alone, particularly in 64-bit mode.



RE: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Ramana Radhakrishnan
[Resent because of account funnies. Apologies to those who get this twice]

Hi,

> > This problem is reported every once in a while, all targets with
> small
> > load-immediate instructions suffer from this, especially since GCC
> 4.0
> > (i.e. since tree-ssa).  But it seems there is just not enough
> interest
> > in having it fixed somehow, or someone would have taken care of it by
> > now.
> >
> > I've summed up before how the problem _could_ be fixed, but I can't
> > find where.  So here we go again.
> >
> > This could be solved in CSE by extending the notion of "related
> > expressions" to constants that can be generated from other constants
> > by a shift. Alternatively, you could create a simple, separate pass
> > that applies CSE's "related expressions" thing in dominator tree
> walk.
> 
> See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for
> handling
> something similar when related expressions differ by a small additive
> constant.  I am planning to finish this and submit it for 4.5.

Wouldn't doing this in CSE only solve the problem within an extended basic
block and not necessarily across the program ? Surely you'd want to do it
globally or am I missing something very basic here ?

Ramana





Dose gcc provide any function to build def-use chain in RTL form

2009-03-16 Thread villa gogh
hi
now i'm trying to construct def-use chain after the PASS_LEAF_REGS.
for the ssa form structure has been destoried during the former
passes.
I have found that gcc provides a way to build the def-use chain in the
PASS_REGRENAME, but it only contains the defs and uses all in one
basic block.

so if I want to get the global def-use data of the whole function,
need i to construct it myself ?

Does gcc provide any function to build the def-use chain in RTL form?

thank you


Re: Dose gcc provide any function to build def-use chain in RTL form

2009-03-16 Thread Paolo Bonzini
villa gogh wrote:
> hi
> now i'm trying to construct def-use chain after the PASS_LEAF_REGS.
> for the ssa form structure has been destoried during the former
> passes.
> I have found that gcc provides a way to build the def-use chain in the
> PASS_REGRENAME, but it only contains the defs and uses all in one
> basic block.

No, don't look at those.  Instead look at fwprop.c which uses use-def
chains -- DU chains are the same but they are computed with

  df_chain_add_problem (DF_DU_CHAIN);

instead of

  df_chain_add_problem (DF_UD_CHAIN);

before df_analyze.

fwprop accesses use-def chains by using DF_REF_CHAIN (use); def-use
chains are the same but the DF_REF_CHAIN macro is used with a def
argument instead.

Paolo


Re: GCC 4.4.0 Status Report (2009-03-13)

2009-03-16 Thread Jack Howarth
What about allowing for more backports from the graphite
branch if this drags out for an extended period of time? In
particular, I am thinking of those changes in graphite branch
that might reduce those cases where -fgraphite-identity
degrades the performance of the resulting code.
 Jack

On Mon, Mar 16, 2009 at 11:10:07AM +0100, Paolo Bonzini wrote:
> NightStrike wrote:
> > On Fri, Mar 13, 2...@1:58 PM, Joseph S. Myers
> >  wrote:
> >> Given the SC request we need to stay in Stage 4 rather than trying to work
> >> around it.
> > 
> > What if GCC went back to stage 3 until the issue is resolved, thus
> > opening the door for a number of stage3-type patches that don't affect
> > 1) licensing and 2) plugin frameworks, but are merely bug fixes which
> > would have long been shaken out by now.
> 
> No, n...@all.  The only benefit we're having from this is that GCC 4.4
> should be quite stable already in GCC 4.4.0, let's not destroy this one too.
> 
> Paolo


Re: sign/zero extension of function arguments on x86-64

2009-03-16 Thread Rafael Espindola
I got mixed results with icc

for
--
short a;
void g(short);
void f(void)
{ g(a); }
--

it produces a movswl. For

---
void g(int);
void f(short a) {
 g(a);
}
--

it produces a  movswq.

For the original test
-
void g(short);
void f(short a) {
 g(a);
}
--

it avoids the extension.

Cheers,
-- 
Rafael Avila de Espindola

Google | Gordon House | Barrow Street | Dublin 4 | Ireland
Registered in Dublin, Ireland | Registration Number: 368047


Re: Typo or intended?

2009-03-16 Thread Andrew Haley
Bingfeng Mei wrote:

> I just updated our porting to include last 2-3 weeks of GCC
> developments. I noticed a large number of test failures at -O1 that
> use a user-defined data type (based on a special register file of
> our processor). All variables of such type are now spilled to memory
> which we don't allow at -O1 because it is too expensive. After
> investigation, I found that it is the following new code causes the
> trouble. I don't quite understand the function of the new code, but
> I don't see what's special for -O1 in terms of register allocation
> in comparison with higher optimizing levels. If I change it to
> (optimize < 1), everthing is fine as before. I start to wonder
> whether (optimize <= 1) is a typo or intended. Thanks in advance.

-O1 is supposed to allow debugging but still optimize, so it's quite
possible that Vlad did intend to do this.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39432

Andrew.



RE: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Adam Nemet
Ramana Radhakrishnan writes:
> [Resent because of account funnies. Apologies to those who get this twice]
> 
> Hi,
> 
> > > This problem is reported every once in a while, all targets with
> > small
> > > load-immediate instructions suffer from this, especially since GCC
> > 4.0
> > > (i.e. since tree-ssa).  But it seems there is just not enough
> > interest
> > > in having it fixed somehow, or someone would have taken care of it by
> > > now.
> > >
> > > I've summed up before how the problem _could_ be fixed, but I can't
> > > find where.  So here we go again.
> > >
> > > This could be solved in CSE by extending the notion of "related
> > > expressions" to constants that can be generated from other constants
> > > by a shift. Alternatively, you could create a simple, separate pass
> > > that applies CSE's "related expressions" thing in dominator tree
> > walk.
> > 
> > See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for
> > handling
> > something similar when related expressions differ by a small additive
> > constant.  I am planning to finish this and submit it for 4.5.
> 
> Wouldn't doing this in CSE only solve the problem within an extended basic
> block and not necessarily across the program ? Surely you'd want to do it
> globally or am I missing something very basic here ?

No, you're not.  There are plans moving some of what's in CSE to a new LCM
(global) pass.  Also note that for a global a pass you clearly need some more
sophisticated cost model for deciding when CSEing is beneficial.  On a
multi-scalar architecture, instructions synthesizing consts sometimes appear
to be "free" whereas holding a value a in a register for an extended period of
time is not.

Adam


Typo or intended?

2009-03-16 Thread Bingfeng Mei
Hello,
I just updated our porting to include last 2-3 weeks of GCC developments. I 
noticed a large number of test failures at -O1 that use a user-defined data 
type (based on a special register file of our processor). All variables of such 
type are now spilled to memory which we don't allow at -O1 because it is too 
expensive. After investigation, I found that it is the following new code 
causes the trouble. I don't quite understand the function of the new code, but 
I don't see what's special for -O1 in terms of register allocation in 
comparison with higher optimizing levels. If I change it to (optimize < 1), 
everthing is fine as before. I start to wonder whether (optimize <= 1) is a 
typo or intended. Thanks in advance.

Cheers,
Bingfeng Mei
Broadcom UK

  if ((! flag_caller_saves && ALLOCNO_CALLS_CROSSED_NUM (a) != 0)
  /* For debugging purposes don't put user defined variables in
 callee-clobbered registers.  */
  || (optimize <= 1   <-  why 
include -O1? 
  && (attrs = REG_ATTRS (regno_reg_rtx [ALLOCNO_REGNO (a)])) != NULL
  && (decl = attrs->decl) != NULL
  && VAR_OR_FUNCTION_DECL_P (decl)
  && ! DECL_ARTIFICIAL (decl)))
{
  IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
call_used_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
call_used_reg_set);
}
  else if (ALLOCNO_CALLS_CROSSED_NUM (a) != 0)
{
  IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
no_caller_save_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a),
temp_hard_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
no_caller_save_reg_set);
  IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a),
temp_hard_reg_set);
}


Re: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Steven Bosscher
On Mon, Mar 16, 2009 at 2:52 PM, Ramana Radhakrishnan
 wrote:
> Wouldn't doing this in CSE only solve the problem within an extended basic
> block and not necessarily across the program ? Surely you'd want to do it
> globally or am I missing something very basic here ?

Why so serious^Wsurely?

I think doing this optimization over extended basic blocks would catch
90% of the cases.  The loop-carried form is covered by auto-increment
generation (and yes I know that pass also needs to be improved ;-)

Ciao!
Steven


Re: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Daniel Berlin
On Mon, Mar 16, 2009 at 12:11 PM, Adam Nemet  wrote:
> Ramana Radhakrishnan writes:
>> [Resent because of account funnies. Apologies to those who get this twice]
>>
>> Hi,
>>
>> > > This problem is reported every once in a while, all targets with
>> > small
>> > > load-immediate instructions suffer from this, especially since GCC
>> > 4.0
>> > > (i.e. since tree-ssa).  But it seems there is just not enough
>> > interest
>> > > in having it fixed somehow, or someone would have taken care of it by
>> > > now.
>> > >
>> > > I've summed up before how the problem _could_ be fixed, but I can't
>> > > find where.  So here we go again.
>> > >
>> > > This could be solved in CSE by extending the notion of "related
>> > > expressions" to constants that can be generated from other constants
>> > > by a shift. Alternatively, you could create a simple, separate pass
>> > > that applies CSE's "related expressions" thing in dominator tree
>> > walk.
>> >
>> > See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for
>> > handling
>> > something similar when related expressions differ by a small additive
>> > constant.  I am planning to finish this and submit it for 4.5.
>>
>> Wouldn't doing this in CSE only solve the problem within an extended basic
>> block and not necessarily across the program ? Surely you'd want to do it
>> globally or am I missing something very basic here ?
>
> No, you're not.  There are plans moving some of what's in CSE to a new LCM
> (global) pass.  Also note that for a global a pass you clearly need some more
> sophisticated cost model for deciding when CSEing is beneficial.  On a
> multi-scalar architecture, instructions synthesizing consts sometimes appear
> to be "free" whereas holding a value a in a register for an extended period of
> time is not.
>

Right. You probably want something closer to nigel horspool's
"isothermal speculative PRE" which takes into account (using
heuristics and profiles) where the best place to put things is based
on costs, instead of LCM, which uses a notion of "lifetime optimality"

See http://webhome.cs.uvic.ca/~nigelh/pubs.html for "Fast
Profile-Based Partial Redundancy Elimination"

There was a working implementation of this done for GCC 4.1 that used
profile info and execution counts.
If you are interested, and can hunt down David Pereira (He isn't at
uvic anymore, and i haven't talked to him since so i don't have his
email), he'd probably give you the code :)


Re: Fwd: Mips, -fpie and TLS management

2009-03-16 Thread Joel Porquet
2009/3/12 Daniel Jacobowitz :
> On Thu, Mar 12, 2009 at 02:02:36PM +0100, Joel Porquet wrote:
>> > Check what symbol is at, or near, 0x4003 + 22368.  It's probably
>> > the GOT plus a constant bias.
>>
>> It seems there is nothing at this address. Here is the program header:
>
> Don't know then.  Look at compiler-generated assembly instead of
> disassembly; that often helps.

Do you mean the object file produced by gcc before linkage?
If yes, the code looks like:

3c05lui a1,0x0
40: R_MIPS_TLS_DTPREL_HI16  a

which will be computed later as

3c054003lui a1,0x4003

>> By the way, how did you test the code of TLS for mips? I mean, uclibc
>> seems the more advanced lib for mips, and although this lib seems to
>> have the necessary code to manage tls once it is "installed", the ldso
>> doesn't contain any code for handling TLS (relocation, tls allocation,
>> etc)...
>
> That statement about uclibc strikes me as bizarre.  I tested it with
> glibc, naturally.  GLIBC has a much more reliable TLS implementation
> than uclibc's in-progress one.

I just downloaded the glibc archive without noticing that the mips
port was in another archive... My mistake..

>> >> Last question, is there a difference between DSO and PIE objects other
>> >> than the INTERP entry in the program header?
>> >
>> > Yes.  Symbol preemption is allowed for DSOs but not for PIEs or normal
>> > executables.  That explains the different choice of model.
>>
>> But this is only a property, isn't it? I was meaning, how can you
>> differenciate them at loading time, when you "analyse" the elf file.
>
> You can't.
>
>> As you surely know, ELF_R_SYM() macro performs (val>>8) which gives
>> the symbol index in order to retrieve the name of the symbol. This
>> name then allows to look up the symbol. Unfortunately, in the case of
>> local-dynamic, ELF_R_SYM will return 0 which is not correct (the same
>> for global-dynamic will return 9): we can see by the way that readelf
>> is not able to get the symbol name. What do you think about this?
>
> This is a *module* relocation.  In local dynamic the module is always
> the current DSO; it does not need a symbol.

But what if the DSO access other module's TLS?

Finally, I noticed another problem. GCC seems to not make room for the
4 arguments as specified in the ABI, when calling __get_tls_addr.
For example, here is an extract of the code for calling (we see that
data are stored directly at the top of the stack):

...
5ffe0bfc:   27bdfff0addiu   sp,sp,-16
5ffe0c00:   afbf000csw  ra,12(sp)
5ffe0c04:   afbcsw  gp,0(sp)
5ffe0c08:   afa40010sw  a0,16(sp)
5ffe0c0c:   100db   5ffe0c44 
5ffe0c10:   nop
5ffe0c14:   8f998030lw  t9,-32720(gp)
5ffe0c18:   27848038addiu   a0,gp,-32712
5ffe0c1c:   0320f809jalrt9
5ffe0c20:   nop
5ffe0c24:   8fbclw  gp,0(sp)
...

The "jalr t9" is the call to get_tls_addr whose code is:

...
5ffe0b40:   27bdffe8addiu   sp,sp,-24
5ffe0b44:   afbcsw  gp,0(sp)
5ffe0b48:   afa40018sw  a0,24(sp)
5ffe0b4c:   7c03e83b0x7c03e83b
...

We notice then that "sw a0, 24(sp)" will erase $gp which was saved at
the same place ("sw gp, 0(gp)") by the caller.

Regards,

Joel


Re: Fwd: Mips, -fpie and TLS management

2009-03-16 Thread Daniel Jacobowitz
On Mon, Mar 16, 2009 at 06:19:01PM +0100, Joel Porquet wrote:
> 2009/3/12 Daniel Jacobowitz :
> > On Thu, Mar 12, 2009 at 02:02:36PM +0100, Joel Porquet wrote:
> >> > Check what symbol is at, or near, 0x4003 + 22368.  It's probably
> >> > the GOT plus a constant bias.
> >>
> >> It seems there is nothing at this address. Here is the program header:
> >
> > Don't know then.  Look at compiler-generated assembly instead of
> > disassembly; that often helps.
> 
> Do you mean the object file produced by gcc before linkage?

That will do, but the actual assembly (-S) is more helpful sometimes.

> > This is a *module* relocation.  In local dynamic the module is always
> > the current DSO; it does not need a symbol.
> 
> But what if the DSO access other module's TLS?

Then it does not use "Local" Dynamic to do so.

> 
> Finally, I noticed another problem. GCC seems to not make room for the
> 4 arguments as specified in the ABI, when calling __get_tls_addr.
> For example, here is an extract of the code for calling (we see that
> data are stored directly at the top of the stack):
> 
> ...
> 5ffe0bfc: 27bdfff0addiu   sp,sp,-16
> 5ffe0c00: afbf000csw  ra,12(sp)
> 5ffe0c04: afbcsw  gp,0(sp)

That line is bogus.  Figure out where it came from; the cprestore
offset should not be zero.

-- 
Daniel Jacobowitz
CodeSourcery


[Fwd: gomp - cost of threadprivate data access]

2009-03-16 Thread Toon Moene

[ Perhaps we need a somewhat larger audience for this one, as it isn't a
  gfortran specific issue (despite the COMMONs). ]

The reporter of this problem (perhaps it's necessary to open a bugzilla 
PR) uses:


It is GNU/linux on x86_64, fedora 10

kernel 2.6.27.12-170.2.5.fc10.x86_64
glibc-2.9-3.x86_64

--
Toon Moene - e-mail: t...@moene.org (*NEW*) - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.4/changes.html
--- Begin Message ---

Hello,

We have parallelized a relatively large f77 project (GEANT3, ~200k loc) using 
OpenMP.


Now we are running comparisons between standard and parallel version and it 
turns out that just making the commons threadprivate results in 20% percent 
speed penalty. This extra time is spent in __tls_get_addr() function which seems 
to be called for every access of a threadprivate variable.


Would it be in principle possible to optimize this access?

I figure that the base address of all referenced commons could be obtained once 
per function thus drastically reducing the __tls_get_addr() call count.


We are using gcc-4.3 branch from the beginning of February, with patches to 
allow equivalence statements among threadprivate data.


Callgrind output of a sample run is available at:

-O2 
-O2 -g  

Best,
Matevz

--- End Message ---


Re: [Fwd: gomp - cost of threadprivate data access]

2009-03-16 Thread Steven Bosscher
On Mon, Mar 16, 2009 at 7:06 PM, Toon Moene  wrote:
> [ Perhaps we need a somewhat larger audience for this one, as it isn't a
>  gfortran specific issue (despite the COMMONs). ]
>
> The reporter of this problem (perhaps it's necessary to open a bugzilla PR)
> uses:
>
> It is GNU/linux on x86_64, fedora 10
>
> kernel 2.6.27.12-170.2.5.fc10.x86_64
> glibc-2.9-3.x86_64

The __tls_get_addr() calls should already be optimized if the proper
TLS model is used.
Do we have a test case?

Ciao!
Steven


Re: improve -fverbose-asm option

2009-03-16 Thread Ian Lance Taylor
Eric Fisher  writes:

> I'd like to get more helpful information from the final .S file, such
> as basic block info, so that I can draw a cfg graph through a script.

The basic block information and the CFG graph is not reliable at that
point in the compilation.  Your patch will work reliably for some
targets and optimization levels but not for others.  The CFG information
is messed up by the machine dependent reorg pass and the delay slot
pass.  I would be worried about confusing people.


> Also, I think it will be better to generate one label for each basic
> block, and the local label should have the function name as the
> suffix. Because some profile tools, such as oprofile, will output
> samples based on the labels. So this will help us to analyze the
> samples for each basic block. But current generated code will have
> many local labels with the same name. Perhaps it's again the
> -fverbose-asm to enable this functionality. But where should I go if I
> wanna implement this functionality?

The local labels used for blocks are normally discarded by the assembler
and thus are never seen by tools like oprofile.  Using named symbols for
basic blocks seems like a reasonable option if it will indeed give
better information from oprofile, but it should be an option separate
from -fverbose-asm.  The labels in RTL are CODE_LABEL insns, so you
would want to change the way that they are emitted in final_scan_insn.
The fact that there can be several CODE_LABELs in sequence doesn't seem
to matter too much, since only one will be picked up by profiling tools.
To be clear, I would want to see that you really do get better results
from profiling tools before accepting such a patch.

Ian


Re: Preprocessor for assembler macros?

2009-03-16 Thread Ian Lance Taylor
"Ph. Marek"  writes:

> Philipp Marek  marek.priv.at> writes:
>> > gcc -S tmp.S for some reason prints to stdout, so gcc -S tmp.S > tmp.s
>> > is what you need
>> Thank you very much, I'll take a look.
> I tried very hard to achieve that; and one time it seemed to work, but I 
> cannot
> make it work again.

I already asked you to take this question to a different mailing list,
and I already answered your question.

http://gcc.gnu.org/ml/gcc/2009-03/msg00187.html

Please take any followups to a different mailing list.

Ian


Difference between local/global/parameter array handling

2009-03-16 Thread Jean Christophe Beyler
Dear all,

I've been working on explaining to GCC the cost of loads/stores on my
target and I arrived to this problem. Consider the following code:

uint64_t sum = 0;
for(i=0; i

Re: Understand BLKmode and returning structure in register.

2009-03-16 Thread Richard Sandiford
"Bingfeng Mei"  writes:
> In foo function, compute_record_mode function will set the mode for
> struct COMPLEX as BLKmode partly because STRICT_ALIGNMENT is 1 on my
> target. In TARGET_RETURN_IN_MEMORY hook, I return 1 for BLKmode type
> and 0 otherwise for small size (<8) (like MIPS).  Thus, this structure
> is still returned through memory, which is not very efficient. More
> importantly, ABI is NOT FIXED under such situation. If an assembly
> code programmer writes a function returning a structure. How does he
> know the structure will be treated as BLKmode or otherwise? So he
> doesn't know whether to pass result through memory or register. Do I
> understand correctly?

Yes.  I think having TARGET_RETURN_IN_MEMORY depend on internal details
like the RTL mode is often seen as an historical mistake.  As you say,
the ABI should be defined directly by the type instead.

Unfortunately, once you start using a mode, it's difficult to stop
using a mode without breaking compatibility.  So one of the main reasons
the MIPS port still uses the mode is because no-one dares touch it.

Likewise, it's now difficult to change the mode attached to a structure
(which could potentially make structure accesses more efficient) without
accidentally breaking someone's ABI.

> On the other hand, if I return 0 only according to struct type's size
> regardless BLKmode or not, GCC will produces very inefficient
> code. For example, stack setup code in foo is still generated even it
> is totally unnecessary.

Yeah, there's definitely room for improvement here.  And as you say,
it's already a problem for MIPS.  I think it's just one of those things
that doesn't occur often enough in critical code for anyone to have
spent time optimising it.

Richard


generic bug in fixed-point constant folding

2009-03-16 Thread Sean D'Epagnier
Hi,

I think I found a generic problem for fixed point constant folding.

In fold-const.c:11872 gcc tries to apply:
  /* Transform (x >> c) << c into x & (-1<> c
 into x & ((unsigned)-1 >> c) for unsigned types.  */

I attached a simple patch which fixes the problem by not applying this
optimization to fixed point types.  I would like to have this
optimization because it is possible.. but the problem is fixed-point
types do not support bitwise operations like & | ^ ~.. so without
supporting these somehow internally but not allowing the user to have
them, this can't take place.

I am open to other suggestions.  For future reference should this be
posted as a bug report?   It seems simple enough that it could be
included right away.. but I feel like if it's a bug report no one will
notice since fixed-point support is not widely used.

Sean
Index: fold-const.c
===
--- fold-const.c	(revision 144210)
+++ fold-const.c	(working copy)
@@ -11877,7 +11877,8 @@ fold_binary (enum tree_code code, tree t
 	  && host_integerp (arg1, false)
 	  && TREE_INT_CST_LOW (arg1) < TYPE_PRECISION (type)
 	  && host_integerp (TREE_OPERAND (arg0, 1), false)
-	  && TREE_INT_CST_LOW (TREE_OPERAND (arg0, 1)) < TYPE_PRECISION (type))
+	  && TREE_INT_CST_LOW (TREE_OPERAND (arg0, 1)) < TYPE_PRECISION (type)
+	  && TREE_CODE (type) != FIXED_POINT_TYPE)
 	{
 	  HOST_WIDE_INT low0 = TREE_INT_CST_LOW (TREE_OPERAND (arg0, 1));
 	  HOST_WIDE_INT low1 = TREE_INT_CST_LOW (arg1);


Re: Constant folding and Constant propagation

2009-03-16 Thread Adam Nemet
Jean Christophe Beyler writes:
> I set up your patch and I get an internal error on this test program:

You're right.  I haven't handled the case properly when the constant itself
was an anchor constant (e.g. 0).  Try this version.

Adam


* cse.c (get_const_anchors): New function.
(insert_const_anchors): New function.
(cse_insn): Set src_related using anchor constants.  Insert
constant anchors into the table of available expressions.

* config/mips/mips.c (mips_rtx_costs): Make immediate-add even cheaper
than loading a simple constant into a register.

Index: gcc/cse.c
===
--- gcc.orig/cse.c  2009-03-08 12:16:56.0 -0700
+++ gcc/cse.c   2009-03-16 23:07:40.0 -0700
@@ -3961,6 +3961,55 @@ record_jump_cond (enum rtx_code code, en
 
   merge_equiv_classes (op0_elt, op1_elt);
 }
+
+#define TARGET_CONST_ANCHOR 0x8000
+
+/* Compute the upper and lower anchors for CST as base, offset pairs.  Return
+   NULL_RTX if CST is equal to an anchor.  */
+
+static rtx
+get_const_anchors (rtx cst, rtx *upper_base, HOST_WIDE_INT *upper_offs,
+  HOST_WIDE_INT *lower_offs)
+{
+  HOST_WIDE_INT n, upper, lower;
+
+  n = INTVAL (cst);
+  lower = n & ~(TARGET_CONST_ANCHOR - 1);
+  if (n == lower)
+return NULL_RTX;
+  upper = (n + (TARGET_CONST_ANCHOR - 1)) & ~(TARGET_CONST_ANCHOR - 1);
+
+  *upper_base = GEN_INT (upper);
+  *upper_offs = n - upper;
+  *lower_offs = n - lower;
+  return GEN_INT (lower);
+}
+
+/* Create equivalences between the two anchors of a constant value and the
+   corresponding register-offset expressions.  Use the register REG, which is
+   equivalent to the constant value CLASSP->exp.  */
+
+static void
+insert_const_anchors (rtx reg, struct table_elt *classp,
+ enum machine_mode mode)
+{
+  rtx lower_base, upper_base;
+  HOST_WIDE_INT lower_offs, upper_offs;
+  rtx lower_exp, upper_exp;
+  struct table_elt *celt;
+  rtx cst = classp->exp;
+
+  lower_base = get_const_anchors (cst, &upper_base, &upper_offs, &lower_offs);
+  if (!lower_base)
+return;
+  lower_exp = plus_constant (reg, -lower_offs);
+  upper_exp = plus_constant (reg, -upper_offs);
+
+  celt = insert (lower_base, NULL, HASH (lower_base, mode), mode);
+  insert (lower_exp, celt, HASH (lower_exp, mode), mode);
+  celt = insert (upper_base, NULL, HASH (upper_base, mode), mode);
+  insert (upper_exp, celt, HASH (upper_exp, mode), mode);
+}
 
 /* CSE processing for one instruction.
First simplify sources and addresses of all assignments
@@ -4595,6 +4644,67 @@ cse_insn (rtx insn)
}
 #endif /* LOAD_EXTEND_OP */
 
+  /* Try to express the constant using a register-offset expression using
+anchor constants.  */
+
+  if (!src_related && src_const && GET_CODE (src_const) == CONST_INT)
+   {
+ rtx lower_base, upper_base;
+ struct table_elt *lower_elt, *upper_elt, *elt;
+ HOST_WIDE_INT lower_offs, upper_offs, offs;
+
+ lower_base = get_const_anchors (src_const, &upper_base, &upper_offs,
+ &lower_offs);
+ if (lower_base)
+   {
+ lower_elt = lookup (lower_base, HASH (lower_base, mode), mode);
+ upper_elt = lookup (upper_base, HASH (upper_base, mode), mode);
+
+ /* Loop over LOWER_ELTs and UPPER_ELTs to find a reg-offset pair
+that we can use to express SRC_CONST.  */
+ elt = NULL;
+ if (lower_elt)
+   {
+ elt = lower_elt->first_same_value;
+ offs = lower_offs;
+   }
+ else if (upper_elt)
+   {
+ elt = upper_elt->first_same_value;
+ upper_elt = NULL;
+ offs = upper_offs;
+   }
+ while (elt)
+   {
+ if (REG_P (elt->exp)
+ || (GET_CODE (elt->exp) == PLUS
+ && REG_P (XEXP (elt->exp, 0))
+ && GET_CODE (XEXP (elt->exp, 1)) == CONST_INT))
+   {
+ rtx x = plus_constant (elt->exp, offs);
+ if (REG_P (x)
+ || (GET_CODE (x) == PLUS
+ && IN_RANGE (INTVAL (XEXP (x, 1)),
+  -TARGET_CONST_ANCHOR,
+  TARGET_CONST_ANCHOR - 1)))
+   {
+ src_related = x;
+ break;
+   }
+   }
+
+ if (!elt->next_same_value && upper_elt)
+   {
+ elt = upper_elt->first_same_value;
+ upper_elt = NULL;
+ offs = upper_offs;
+   }
+ else
+   elt = elt->next_same_value;
+