Re: LRA reloads of subregs

2015-09-04 Thread Segher Boessenkool
On Thu, Sep 03, 2015 at 11:19:43PM -0700, David Miller wrote:
> From: Segher Boessenkool 
> Date: Thu, 3 Sep 2015 20:26:51 -0500
> 
> > On Thu, Sep 03, 2015 at 03:33:56PM -0700, David Miller wrote:
> >> (insn 18631 1099 1100 14 (set (reg:SI 13423)
> >> (subreg:SI (mem/c:QI (plus:SI (reg/f:SI 101 %sfp)
> >> (const_int -14269 [0xc843])) [0 
> >> %sfp+-14269 S1 A8]) 0)) x.c:104 63 {*movsi_insn}
> >>  (expr_list:REG_DEAD (reg:QI 287)
> >> (nil)))
> > 
> >> I wonder why another target's LRA conversion hasn't hit this :-)
> > 
> > Maybe a stupid question but... why are you seeing subregs of mem at all?
> > Does sparc not have INSN_SCHEDULING defined?
> 
> The paradoxical subreg restriction in general_operand() is only
> enforced when reload_completed is true, which will not be the
> case while LRA is working.

This one?

#ifdef INSN_SCHEDULING
  /* On machines that have insn scheduling, we want all memory
 reference to be explicit, so outlaw paradoxical SUBREGs.
 However, we must allow them after reload so that they can
 get cleaned up by cleanup_subreg_operands.  */
  if (!reload_completed && MEM_P (sub)
  && GET_MODE_SIZE (mode) > GET_MODE_SIZE (GET_MODE (sub)))
return 0;
#endif

I think you misread that.  Also doc/rtl.texi makes pretty clear that
you really shouldn't see subregs of mem.  So where does it come from?


Segher


Re: LRA reloads of subregs

2015-09-04 Thread Vladimir Makarov
On 09/03/2015 06:33 PM, David Miller wrote:

I'm working on converting sparc to LRA, and thanks probably to the
work the powerpc folks did this is going much better than when I
last tried this.

Thanks for working on this, David.

The first major stumbling block I've run into is when LRA forces
a reload for a SUBREG, and specifically there is a MEM involved
that itself needs a reload due to having an invalid address.

For example, simplify_operand_subreg() is working on this insn:

(insn 18631 1099 1100 14 (set (reg:SI 13423)
 (subreg:SI (mem/c:QI (plus:SI (reg/f:SI 101 %sfp)
 (const_int -14269 [0xc843])) [0 %sfp+-14269 S1 
A8]) 0)) x.c:104 63 {*movsi_insn}
  (expr_list:REG_DEAD (reg:QI 287)
 (nil)))

lra_emit_move() (via insert_move_for_subreg()) is called (here, 'reg'
is the MEM expression).

Because the expression is a MEM, all of the special cased code in
lra_emit_move() meant to avoid invalid displacements and indexes is
not used, and it just performs a plain emit_move_insn().

Calling emit_move_insn() does not work properly because it emits code
which needs reloads, to handle the too large CONST_INT offset in the
MEM expression.

We abort because lra_process_new_insns() expects everything emitted
by insert_move_for_subreg() to be recognizable, and with that too
large offset it cannot.

I wonder why another target's LRA conversion hasn't hit this :-)
I guess the insn should be forced to be recognizable when in 
lra_in_progress is TRUE.  In this case LRA can get insn operands and 
transform them to be valid.


LRA porting frequently needs changing in constraints.md, .c, and 
.md files.

Vlad I wonder how you'd like this to be handled?  The code to handle
this kind of situation is there in the process_address infrastructure.
I don't think we should add a new LRA code calling process_address 
before adding insns for further processing.  LRA just needs to get 
operands from insns to make them valid.  So again I'd try to make insn 
recognizable for LRA first and only if it does not work then think about 
other solutions in case when such change creates other problems (it is 
hard for me to predict LRA behaviour definitely just reading source 
files and not knowing sparc port well).


Re: incremental compiler project

2015-09-04 Thread Tom Tromey
Manuel> The overall goal of the project is worthwhile, however, it is unclear
Manuel> whether the approach envisioned in the wiki page will lead to the
Manuel> desired benefits. See http://tromey.com/blog/?p=420 which is the last
Manuel> status report that I am aware of.

Yeah.  I stopped working on that project when my manager at the time
asked me to work on gdb instead.

I think the goal of that project is still relevant, in that C++
compilation is still just too darn slow.  Projects today (e.g., firefox)
still do the "include the .cc files" trick to get a compilation
performance boost.

On the other hand, I'm not sure the incremental compiler is the way to
go.  It is a complicated approach.

Perhaps better would be to tackle things head on; that is, push harder
for modules in C and C++ and fix the problem at its root.

Tom


Re: incremental compiler project

2015-09-04 Thread David Kunsman
what do you think about the sub project in the wiki:

Parallel Compilation:

One approach is to make the front end multi-threaded. (I've pretty
much abandoned this idea. There are too many mutable tree fields,
making this a difficult project. Also, threads do not interact well
with fork, which is currently needed by the code generation approach.)

This will entail removing most global variables, marking some with
__thread, and wrapping a few with locks.

For the C front end, sharing will take place at the hunk level. The
parser will acquire a lock on a hunk's key before parsing (or
re-using) the hunk. Once the hunk has been registered the lock will be
released. This will allow reasonably fine-grained sharing without
possibility of deadlock.

This sub-project will also require updates to the GC. The current plan
is to have ggc_collect be a safe point; this works well with the C
parser as it collects in each iteration of the main loop. The C++
parser does not do this, and will need to be modified. Additionally,
the server main loop will either need not to hold GC-able data, or it
will need a way to inform the GC of its activity. (Note that the GC
work is completed on the branch. The C++ parser has not been modified
to periodically collect, however.)

?

On Fri, Sep 4, 2015 at 10:11 AM, Tom Tromey  wrote:
> Manuel> The overall goal of the project is worthwhile, however, it is unclear
> Manuel> whether the approach envisioned in the wiki page will lead to the
> Manuel> desired benefits. See http://tromey.com/blog/?p=420 which is the last
> Manuel> status report that I am aware of.
>
> Yeah.  I stopped working on that project when my manager at the time
> asked me to work on gdb instead.
>
> I think the goal of that project is still relevant, in that C++
> compilation is still just too darn slow.  Projects today (e.g., firefox)
> still do the "include the .cc files" trick to get a compilation
> performance boost.
>
> On the other hand, I'm not sure the incremental compiler is the way to
> go.  It is a complicated approach.
>
> Perhaps better would be to tackle things head on; that is, push harder
> for modules in C and C++ and fix the problem at its root.
>
> Tom


Re: incremental compiler project

2015-09-04 Thread Jeff Law
On 09/04/2015 09:40 AM, David Kunsman wrote:

what do you think about the sub project in the wiki:

Parallel Compilation:

One approach is to make the front end multi-threaded. (I've pretty
much abandoned this idea. There are too many mutable tree fields,
making this a difficult project. Also, threads do not interact well
with fork, which is currently needed by the code generation approach.)
You should get in contact with David Malcolm as these issues are 
directly related to his JIT work.


This will entail removing most global variables, marking some with
__thread, and wrapping a few with locks.
Yes, but that's work that is already in progress.  Right now David's got 
a big log and context switch in place, but we really want to drive down 
the amount of stuff in that context switch.



Jeff



Re: incremental compiler project

2015-09-04 Thread Manuel López-Ibáñez
On 4 September 2015 at 17:11, Tom Tromey  wrote:
> Manuel> The overall goal of the project is worthwhile, however, it is unclear
> Manuel> whether the approach envisioned in the wiki page will lead to the
> Manuel> desired benefits. See http://tromey.com/blog/?p=420 which is the last
> Manuel> status report that I am aware of.
>
> Yeah.  I stopped working on that project when my manager at the time
> asked me to work on gdb instead.
>
> I think the goal of that project is still relevant, in that C++
> compilation is still just too darn slow.  Projects today (e.g., firefox)
> still do the "include the .cc files" trick to get a compilation
> performance boost.

But we don't even know why it is so slow, no? (Or perhaps it is
known:https://gcc.gnu.org/wiki/Speedup_areas#C.2B-.2B-_improvements
but no one has decided to fix them)

Clang++ is much faster yet it is doing more and tracking more data
than cc1plus. Thus, there have to be things that can be optimized in
the C++ parser. For example, we know that by-passing the textual
assembler representation has to speed-up compilation
(https://gcc.gnu.org/ml/gcc-patches/2015-06/msg00188.html), specially
for large programs (simply printing it more efficiently already leads
to measurable speed-ups:
https://gcc.gnu.org/ml/gcc/2011-06/msg00156.html).

> On the other hand, I'm not sure the incremental compiler is the way to
> go.  It is a complicated approach.

Perhaps libcc1 could be re-purposed for this? It allows inserting code
into an already existing binary. Perhaps it could allow replacing code
from it? I have only a nebulous idea of how libcc1 works, maybe this
does not make any sense.

> Perhaps better would be to tackle things head on; that is, push harder
> for modules in C and C++ and fix the problem at its root.

Probably yes. Unfortunately, I don't know of any plans to implement
this in any form (much less for C).

Cheers,

Manuel.


Re: incremental compiler project

2015-09-04 Thread Jonathan Wakely
On 4 September 2015 at 16:57, Manuel López-Ibáñez wrote:
> Clang++ is much faster yet it is doing more and tracking more data
> than cc1plus.

How much faster these days? In my experience for optimized builds of
large files the difference is not so impressive (for unoptimized
builds clang is definitely much faster).


Re: Offer of help with move to git

2015-09-04 Thread Joseph Myers
On Thu, 27 Aug 2015, Jason Merrill wrote:

> Unfortunately, it looks like reposurgeon doesn't deal with gcc SVN's
> subdirectory branches any better than git-svn.  It does give a diagnostic
> about them:
> 
> reposurgeon: branch links detected by file ops only: branches/suse/
> branches/apple/ branches/st/ branches/gcj/ branches/csl/ branches/google/
> branches/linaro/ branches/redhat/ branches/ARM/ tags/ix86/ branches/ubuntu/
> branches/ix86/
> 
> though this is an incomplete list.  There are also also branches/ibm,
> branches/dead, tags/apple, tags/redhat, tags/csl, and tags/ubuntu.

I agree with that list as being a list of subdirectories of branches/ and 
tags/ that are not themselves branches or tags, but rather containers for 
branches and tags.

branches/st is more complicated than simply being a container for 
subdirectory branches.  It has a README file, five cli* subdirectories 
that look like branches of GCC, two subdirectories binutils/ and 
mono-based-binutils/ that are essentially separate projects (this is not 
of course any problem for git - having a branch sharing no ancestry with 
other branches is absolutely fine), and a subdirectory tags that contains 
tags of those various branches (I think).  So you want to say: 
branches/st/tags/* are tags; branches/st/* (subdirectories other than 
tags/) are branches; branches/st/README I don't know how you should handle 
(I suppose it could be a branch on its own, that just contains a README 
file, with commits affecting both README and other branches being split 
into separate commits to each piece; it is something that's still 
meaningful after the conversion and that ought to end up in the converted 
repository in some form).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: incremental compiler project

2015-09-04 Thread Jeff Law
On 09/04/2015 10:14 AM, Jonathan Wakely wrote:

On 4 September 2015 at 16:57, Manuel López-Ibáñez wrote:

Clang++ is much faster yet it is doing more and tracking more data
than cc1plus.


How much faster these days? In my experience for optimized builds of
large files the difference is not so impressive (for unoptimized
builds clang is definitely much faster).
Which would generally indicate that the front-end and mandatory parts of 
the middle/backend are slow for GCC (relatively to clang/llvm), but the 
optimizers in GCC are faster.


That wouldn't be a huge surprise given home much time has been spent 
trying to keep the optimizers fast.


jeff


Re: incremental compiler project

2015-09-04 Thread Manuel López-Ibáñez
On 4 September 2015 at 17:44, Jeff Law  wrote:
> On 09/04/2015 09:40 AM, David Kunsman wrote:
>>
>> what do you think about the sub project in the wiki:
>>
>> Parallel Compilation:
>>
>> One approach is to make the front end multi-threaded. (I've pretty
>> much abandoned this idea. There are too many mutable tree fields,
>> making this a difficult project. Also, threads do not interact well
>> with fork, which is currently needed by the code generation approach.)
>
> You should get in contact with David Malcolm as these issues are directly
> related to his JIT work.

See https://gcc.gnu.org/wiki/JIT

If I remember correctly from past discussions, making the C/C++ FEs
multi-threaded is not really desired. People already compile multiple
files in parallel using 'make -j' and it is expected that many
bottlenecks would require sequential execution anyway. Adding locks
within the FEs may make them slower rather than faster.

However, making the FEs thread-safer would be great for JIT and for
converting them into reusable libraries (or if someone comes up with a
feasible GCC server design).

>> This will entail removing most global variables, marking some with
>> __thread, and wrapping a few with locks.
>
> Yes, but that's work that is already in progress.  Right now David's got a
> big log and context switch in place, but we really want to drive down the
> amount of stuff in that context switch.

Removing global variables would already be a big help. For example,
most flag_* global variables (example: flag_no_line_commands) can be
encoded in struct gcc_options and could be removed. This would be a
small project to start with, since one can do the conversion one at a
time.

Cheers,

Manuel.


Fwd: incremental compiler project

2015-09-04 Thread David Kunsman
I was just looking to get into a project...and the incremental project
caught my eyewondering if it even practical due to the branch is
over 6 years old...and merging everything with the current trunk would
be a job.  It seems like many of the projects on the wiki are out of
date.  Does anybody know a current project that needs major help?  I
would be more than happy to work on it.

On Fri, Sep 4, 2015 at 11:37 AM, David Kunsman  wrote:
> I was just looking to get into a project...and the incremental project
> caught my eyewondering if it even practical due to the branch is
> over 6 years old...and merging everything with the current trunk would
> be a job.  It seems like many of the projects on the wiki are out of
> date.  Does anybody know a current project that needs major help?  I
> would be more than happy to work on it.
>
> On Fri, Sep 4, 2015 at 11:31 AM, Manuel López-Ibáñez
>  wrote:
>> On 4 September 2015 at 17:44, Jeff Law  wrote:
>>> On 09/04/2015 09:40 AM, David Kunsman wrote:

 what do you think about the sub project in the wiki:

 Parallel Compilation:

 One approach is to make the front end multi-threaded. (I've pretty
 much abandoned this idea. There are too many mutable tree fields,
 making this a difficult project. Also, threads do not interact well
 with fork, which is currently needed by the code generation approach.)
>>>
>>> You should get in contact with David Malcolm as these issues are directly
>>> related to his JIT work.
>>
>> See https://gcc.gnu.org/wiki/JIT
>>
>> If I remember correctly from past discussions, making the C/C++ FEs
>> multi-threaded is not really desired. People already compile multiple
>> files in parallel using 'make -j' and it is expected that many
>> bottlenecks would require sequential execution anyway. Adding locks
>> within the FEs may make them slower rather than faster.
>>
>> However, making the FEs thread-safer would be great for JIT and for
>> converting them into reusable libraries (or if someone comes up with a
>> feasible GCC server design).
>>
 This will entail removing most global variables, marking some with
 __thread, and wrapping a few with locks.
>>>
>>> Yes, but that's work that is already in progress.  Right now David's got a
>>> big log and context switch in place, but we really want to drive down the
>>> amount of stuff in that context switch.
>>
>> Removing global variables would already be a big help. For example,
>> most flag_* global variables (example: flag_no_line_commands) can be
>> encoded in struct gcc_options and could be removed. This would be a
>> small project to start with, since one can do the conversion one at a
>> time.
>>
>> Cheers,
>>
>> Manuel.


Re: incremental compiler project

2015-09-04 Thread David Malcolm
On Fri, 2015-09-04 at 09:44 -0600, Jeff Law wrote:
> On 09/04/2015 09:40 AM, David Kunsman wrote:
> > what do you think about the sub project in the wiki:
> >
> > Parallel Compilation:
> >
> > One approach is to make the front end multi-threaded. (I've pretty
> > much abandoned this idea. There are too many mutable tree fields,
> > making this a difficult project. Also, threads do not interact well
> > with fork, which is currently needed by the code generation approach.)
> You should get in contact with David Malcolm as these issues are 
> directly related to his JIT work.
> >
> > This will entail removing most global variables, marking some with
> > __thread, and wrapping a few with locks.
> Yes, but that's work that is already in progress.  Right now David's got 
> a big log and context switch in place, but we really want to drive down 
^^^
   "lock"
> the amount of stuff in that context switch.

FWIW, grep for "ACQUIRE MUTEX" and "RELEASE MUTEX" within:
https://gcc.gnu.org/onlinedocs/jit/internals/index.html#overview-of-code-structure

I probably should better document what state is guarded by jit_mutex:
basically it's anything within libbackend.a, including anything that
interacts with GTY/ggc.  (in fact, it's basically every source file,
apart from libgccjit.c, jit-recording.c, and parts of jit-playback.c).

You may or may not want to read this doc I wrote in 2013:
  https://dmalcolm.fedorapeople.org/gcc/global-state/
(Re-reading it now, it's very out-of-date and I no longer agree with
much of what I wrote in that doc, so I don't know if it's useful).




Updating @gcc.gnu.org email forwarding

2015-09-04 Thread Henderson, Stuart
Hi,
I'm looking to update the forwarding address for my @gcc.gnu.org email address, 
but appear to have lost (if I ever had) my private key.  Could someone point me 
in the right direction for fixing this?
Thanks,
Stu


Re: Live range Analysis based on tree representations

2015-09-04 Thread Aaron Sawdey
On Thu, 2015-09-03 at 15:22 +, Ajit Kumar Agarwal wrote:
> 
> 
> -Original Message-
> From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] 
> Sent: Wednesday, September 02, 2015 8:23 PM
> To: Ajit Kumar Agarwal
> Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
> Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: Live range Analysis based on tree representations
> 
> On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote:
> > All:
> > 
> > The Live ranges info on tree SSA representation is important step towards 
> > the SSA based code motion optimizations.
> > As the code motion optimization based on the SSA representation 
> > effects the register pressure and reasons for performance Bottleneck.
> > 
> > I am proposing the Live range Analysis based on the SSA 
> > representation. The Live range analysis traverses the dominator Tree. 
> > The SSA and phi variables are represented based on dominance frontier info 
> > and the SSA representation reflects The dominance info. Based on such 
> > dominance info Live range Overlapping Analysis can be derived.
> > 
> > Variable V intersects W if Vdef dominates the Wdef. The variable v 
> > intersects at point p if Vdef dominates P and Wdef Dominates the P. If 
> > Vdef dominates Wdef and Wdef dominates Udef , then the Vdef dominates 
> > Udef and thus Live range Of V intersect W and live range W intersect 
> > U, thus the live range V intersects the U. Such dominance info can be 
> > used to Represent the Overlapping Live range Analysis and the register 
> > pressure is derived from Overlapping Live ranges based On the dominator 
> > info inherited from the SSA representation. The SSA representation is 
> > derived based on dominance Frontier and the traversal of dominator tree 
> > based on SSA can derive the Overlapping Live ranges.
> > 
> > The above Overlapping Live range info can be used to derive the 
> > register pressure and the optimization based out of tree Representation can 
> > use the above overlapping live ranges to take register pressure into 
> > account.
> 
> >>Ajit,
>   >>I did a prototype of this kind of analysis at one point last year to see 
> if it could help improve inlining decisions in LTO. Basically I did >>exactly 
> what you suggest and computed the number of overlapping SSA live ranges and 
> used that as a proxy for register pressure. It >>did appear to be able to 
> help in some circumstances but the real solution is to improve register 
> allocation so it doesn't fall down when >>register pressure gets high.
> 
> Aaron:
> Would you mind in explaining on what circumstances it helps and when it 
> won't.  The Live ranges on SSA
> representation forms chordal Graphs that might have the different 
> colorability requirements than the real
> register allocator. This may not give exact register pressure compared to  
> register allocator as register allocator 
> is  further down the optimization and the code generation pipeline  but forms 
> the basis of optimization based 
> on SSA that effects the register pressure.
>  
> >>The code is in a branch called lto-pressure.
> 
> Thanks. I would like to see the code.

Ajit,
  The branch is here: svn://gcc.gnu.org/svn/gcc/branches/lto-pressure
The analysis is in ipa-inline.c; if you search for "pressure" you'll
find the code.

The live ranges in SSA are certainly different than what the register
allocator is going to see, but it's what we have to work with at the
point where the inlining decisions are made, which is why I was looking
at it. My hope was that it would be a reasonable proxy for downstream
register pressure. 

I went and did this after seeing a particular situation in bzip2 where a
bunch of inlining done by LTO sent register pressure up a lot and
resulted in a measureable increase in loads/stores due to extra spill
code. Functions that are called in only one place and not externally
visible will be inlined regardless of their size. There is one function
in bzip2 that has particularly complex control and inlining this into
another non-trivial function caused all the excess spill code.

Setting limits on "inline functions called only once" did work, i.e.
inhibited the particular inlining that I wanted to eliminate; it just
didn't help enough to justify the complexity of the extra analysis. So I
went a step further and put the caller and callee pressure in as a term
in the edge badness used to prioritize inlining of small functions. This
seemed to help in some cases and hurt others, probably because SSA
pressure doesn't accurately represent the kind of register pressure that
causes problems later.

Looking at the code a year+ later I see a bug already in how I
implemented the callee/caller pressure limit, but I tested it without
that enabled so that wasn't all of the problem.

I abandoned the work because it seemed to me the the community favored
fixing these issues in register allocation.

  Aaron

> 
> Th

Re: LRA reloads of subregs

2015-09-04 Thread David Miller
From: Vladimir Makarov 
Date: Fri, 4 Sep 2015 10:00:54 -0400

> LRA porting frequently needs changing in constraints.md, .c,
> and .md files.

I did make such changes, trust me :-)

First obstacle was that, unlike reload, LRA is very strict about
register constraints.  If a constraint doesn't evaluate to a
register class, LRA refuses to consider it a register.

So we had this ugly thing:

(define_constraint "U"
 "Pseudo-register or hard even-numbered integer register"
 (and (match_test "TARGET_ARCH32")
  (match_code "reg")
  (ior (match_test "REGNO (op) < FIRST_PSEUDO_REGISTER")
   (not (match_test "reload_in_progress && reg_renumber [REGNO (op)] < 
0")))
  (match_test "register_ok_for_ldd (op)")))

A few years ago I tried the simple thing, changing this to a plain
GENERAL_REGS register constraint and hoping that HARD_REGNO_OK would
properly enforce the even register number requirement.  Back then it
didn't work but now it appears to work properly.

I've included the full patch I am working with below in case you are
curious.

> I don't think we should add a new LRA code calling process_address
> before adding insns for further processing.  LRA just needs to get
> operands from insns to make them valid.  So again I'd try to make insn
> recognizable for LRA first and only if it does not work then think
> about other solutions in case when such change creates other problems
> (it is hard for me to predict LRA behaviour definitely just reading
> source files and not knowing sparc port well).

If LRA is prepared to do a blind emit_move_insn() on an arbitrary MEM,
before it even validates the displacements in such a MEM as needing
reloads or not, it has to do something to accomodate this situation.

If LRA had not done the special SUBREG processing on this insn, it
indeed would have fixed up the invalid displacement using a reload.

Anyways, here is the patch I am working with.

diff --git a/gcc/config/sparc/constraints.md b/gcc/config/sparc/constraints.md
index e12efa1..7a18879 100644
--- a/gcc/config/sparc/constraints.md
+++ b/gcc/config/sparc/constraints.md
@@ -44,6 +44,8 @@
 (define_register_constraint "h" "(TARGET_V9 && TARGET_V8PLUS ? I64_REGS : 
NO_REGS)"
  "64-bit global or out register in V8+ mode")
 
+(define_register_constraint "U" "(TARGET_ARCH32 ? GENERAL_REGS : NO_REGS)")
+
 ;; Floating-point constant constraints
 
 (define_constraint "G"
@@ -135,51 +137,6 @@
   (match_code "mem")
   (match_test "memory_ok_for_ldd (op)")))
 
-;; This awkward register constraint is necessary because it is not
-;; possible to express the "must be even numbered register" condition
-;; using register classes.  The problem is that membership in a
-;; register class requires that all registers of a multi-regno
-;; register be included in the set.  It is add_to_hard_reg_set
-;; and in_hard_reg_set_p which populate and test regsets with these
-;; semantics.
-;;
-;; So this means that we would have to put both the even and odd
-;; register into the register class, which would not restrict things
-;; at all.
-;;
-;; Using a combination of GENERAL_REGS and HARD_REGNO_MODE_OK is not a
-;; full solution either.  In fact, even though IRA uses the macro
-;; HARD_REGNO_MODE_OK to calculate which registers are prohibited from
-;; use in certain modes, it still can allocate an odd hard register
-;; for DImode values.  This is due to how IRA populates the table
-;; ira_useful_class_mode_regs[][].  It suffers from the same problem
-;; as using a register class to describe this restriction.  Namely, it
-;; sets both the odd and even part of an even register pair in the
-;; regset.  Therefore IRA can and will allocate odd registers for
-;; DImode values on 32-bit.
-;;
-;; There are legitimate cases where DImode values can end up in odd
-;; hard registers, the most notable example is argument passing.
-;;
-;; What saves us is reload and the DImode splitters.  Both are
-;; necessary.  The odd register splitters cannot match if, for
-;; example, we have a non-offsetable MEM.  Reload will notice this
-;; case and reload the address into a single hard register.
-;;
-;; The real downfall of this awkward register constraint is that it does
-;; not evaluate to a true register class like a bonafide use of
-;; define_register_constraint would.  This currently means that we cannot
-;; use LRA on Sparc, since the constraint processing of LRA really depends
-;; upon whether an extra constraint is for registers or not.  It uses
-;; reg_class_for_constraint, and checks it against NO_REGS.
-(define_constraint "U"
- "Pseudo-register or hard even-numbered integer register"
- (and (match_test "TARGET_ARCH32")
-  (match_code "reg")
-  (ior (match_test "REGNO (op) < FIRST_PSEUDO_REGISTER")
-  (not (match_test "reload_in_progress && reg_renumber [REGNO (op)] < 
0")))
-  (match_test "register_ok_for_ldd (op)")))
-
 ;; Equivalent to 'T' but available in 64-bit mode
 (define_memory_constraint "W"
  "Memory refere

Re: LRA reloads of subregs

2015-09-04 Thread David Miller
From: Segher Boessenkool 
Date: Fri, 4 Sep 2015 06:46:04 -0500

> On Thu, Sep 03, 2015 at 11:19:43PM -0700, David Miller wrote:
>> The paradoxical subreg restriction in general_operand() is only
>> enforced when reload_completed is true, which will not be the
>> case while LRA is working.
> 
> This one?
> 
> #ifdef INSN_SCHEDULING
>   /* On machines that have insn scheduling, we want all memory
>reference to be explicit, so outlaw paradoxical SUBREGs.
>However, we must allow them after reload so that they can
>get cleaned up by cleanup_subreg_operands.  */
>   if (!reload_completed && MEM_P (sub)
> && GET_MODE_SIZE (mode) > GET_MODE_SIZE (GET_MODE (sub)))
>   return 0;
> #endif
> 
> I think you misread that.  Also doc/rtl.texi makes pretty clear that
> you really shouldn't see subregs of mem.  So where does it come from?

I see what you are saying, I'll take a look into this.

Thanks.


Re: LRA reloads of subregs

2015-09-04 Thread David Miller
From: David Miller 
Date: Fri, 04 Sep 2015 11:30:26 -0700 (PDT)

> From: Segher Boessenkool 
> Date: Fri, 4 Sep 2015 06:46:04 -0500
> 
>> On Thu, Sep 03, 2015 at 11:19:43PM -0700, David Miller wrote:
>>> The paradoxical subreg restriction in general_operand() is only
>>> enforced when reload_completed is true, which will not be the
>>> case while LRA is working.
>> 
>> This one?
>> 
>> #ifdef INSN_SCHEDULING
>>   /* On machines that have insn scheduling, we want all memory
>>   reference to be explicit, so outlaw paradoxical SUBREGs.
>>   However, we must allow them after reload so that they can
>>   get cleaned up by cleanup_subreg_operands.  */
>>   if (!reload_completed && MEM_P (sub)
>>&& GET_MODE_SIZE (mode) > GET_MODE_SIZE (GET_MODE (sub)))
>>  return 0;
>> #endif
>> 
>> I think you misread that.  Also doc/rtl.texi makes pretty clear that
>> you really shouldn't see subregs of mem.  So where does it come from?
> 
> I see what you are saying, I'll take a look into this.

It looks like it is created in LRA itself, initially LRA is looking
at:

(insn 1100 1099 1101 14 (set (reg:SI 3376)
(ior:SI (subreg:SI (reg:QI 287) 0)
(subreg:SI (reg:QI 289) 0))) x.c:104 234 {iorsi3}
 (expr_list:REG_DEAD (reg:QI 289)
(expr_list:REG_DEAD (reg:QI 287)
(nil

in curr_insn_transform(), and emits the move:

(set (reg:SI 13423) (subreg:SI (reg:QI 287) 0))

Later the reg inside of this subreg appears to get transformed into
an on-stack MEM.

(insn 18631 1099 1100 14 (set (reg:SI 13423)
(subreg:SI (mem/c:QI (plus:SI (reg/f:SI 101 %sfp)
(const_int -14269 [0xc843])) [0 %sfp+-14269 S1 
A8]) 0)) x.c:104 63 {*movsi_insn}
 (expr_list:REG_DEAD (reg:QI 287)
(nil)))

I suppose perhaps I need to make the input_operand predicate more
strict on sparc.  So I'll look into that now.



Re: LRA reloads of subregs

2015-09-04 Thread David Miller
From: David Miller 
Date: Fri, 04 Sep 2015 11:27:31 -0700 (PDT)

> From: Vladimir Makarov 
> Date: Fri, 4 Sep 2015 10:00:54 -0400
> 
>> I don't think we should add a new LRA code calling process_address
>> before adding insns for further processing.  LRA just needs to get
>> operands from insns to make them valid.  So again I'd try to make insn
>> recognizable for LRA first and only if it does not work then think
>> about other solutions in case when such change creates other problems
>> (it is hard for me to predict LRA behaviour definitely just reading
>> source files and not knowing sparc port well).

I've taken some time to see exactly what is going on here, perhaps
you can give me some guidance, I'm quite happy to implement anything
:-)

We start with:

(insn 1100 1099 1101 14 (set (reg:SI 3376)
(ior:SI (subreg:SI (reg:QI 287) 0)
(subreg:SI (reg:QI 289) 0))) x.c:104 234 {iorsi3}
 (expr_list:REG_DEAD (reg:QI 289)
(expr_list:REG_DEAD (reg:QI 287)
(nil

LRA emits, in curr_insn_transform():

(set (reg:SI 13423) (subreg:SI (reg:QI 287) 0))

LRA then spills the subreg onto the stack, which gives us:

(insn 18631 1099 1100 14 (set (reg:SI 13423)
(subreg:SI (mem/c:QI (plus:SI (reg/f:SI 101 %sfp)
(const_int -14269 [0xc843])) [0 %sfp+-14269 S1 
A8]) 0)) x.c:104 63 {*movsi_insn}
 (expr_list:REG_DEAD (reg:QI 287)
(nil)))

And this is where we run into trouble in simplify_operand_subreg(),
which seems to force reloads for all SUBREGs of MEM.

Normally, if there were no SUBREG here, LRA would run
process_address() over the MEMs in this instruction and all would be
well.

It is also the case that I cannot do anything special in the SPARC
move emitter to handle this, as address validization is disabled when
lra_in_progress is true.


Re: LRA reloads of subregs

2015-09-04 Thread Vladimir Makarov
On 09/04/2015 09:02 PM, David Miller wrote:

From: David Miller 
Date: Fri, 04 Sep 2015 11:27:31 -0700 (PDT)


From: Vladimir Makarov 
Date: Fri, 4 Sep 2015 10:00:54 -0400


I don't think we should add a new LRA code calling process_address
before adding insns for further processing.  LRA just needs to get
operands from insns to make them valid.  So again I'd try to make insn
recognizable for LRA first and only if it does not work then think
about other solutions in case when such change creates other problems
(it is hard for me to predict LRA behaviour definitely just reading
source files and not knowing sparc port well).

I've taken some time to see exactly what is going on here, perhaps
you can give me some guidance, I'm quite happy to implement anything
:-)


Ok, if modifying constraint/insn definitions does not work, then you 
could try to split the 1st loop in 
lra-constraints.c::curr_insn_transform onto 2 loops: one with processing 
subregs and another one with the rest of the original loop code.  Then 
you need to put the new loop processing subregs after the loop 
processing addresses.  I think it will work.  Of course the change will 
need a lot of testing on other platforms (at least on x86/x86-64).

We start with:

(insn 1100 1099 1101 14 (set (reg:SI 3376)
 (ior:SI (subreg:SI (reg:QI 287) 0)
 (subreg:SI (reg:QI 289) 0))) x.c:104 234 {iorsi3}
  (expr_list:REG_DEAD (reg:QI 289)
 (expr_list:REG_DEAD (reg:QI 287)
 (nil

LRA emits, in curr_insn_transform():

(set (reg:SI 13423) (subreg:SI (reg:QI 287) 0))

LRA then spills the subreg onto the stack, which gives us:

(insn 18631 1099 1100 14 (set (reg:SI 13423)
 (subreg:SI (mem/c:QI (plus:SI (reg/f:SI 101 %sfp)
 (const_int -14269 [0xc843])) [0 %sfp+-14269 S1 
A8]) 0)) x.c:104 63 {*movsi_insn}
  (expr_list:REG_DEAD (reg:QI 287)
 (nil)))

And this is where we run into trouble in simplify_operand_subreg(),
which seems to force reloads for all SUBREGs of MEM.

Normally, if there were no SUBREG here, LRA would run
process_address() over the MEMs in this instruction and all would be
well.

It is also the case that I cannot do anything special in the SPARC
move emitter to handle this, as address validization is disabled when
lra_in_progress is true.