Re: GSOC Question about the parallelization project

2018-03-21 Thread Sebastiaan Peters
>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm  wrote:
>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>>>  wrote:
>>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters >> > 7...@hotmail.com> wrote:
>>> > > > The goal should be to extend TU wise parallelism via make to
>>> > > > function
>>> > >
>>> > > wise parallelism within GCC.
>>> > >
>>> > > Could you please elaborate more on this?
>>> >
>>> > In the abstract sense you'd view the compilation process separated
>>> > into N stages, each function being processed by each. You'd assign
>>> > a thread to each stage and move the work items (the functions)
>>> > across the set of threads honoring constraints such as an IPA stage
>>> > needing all functions completed the previous stage. That allows you
>>> > to easier model the constraints due to shared state (like no pass
>>> > operating on two functions at the same time) compared to a model
>>> > where you assign a thread to each function.
>>> >
>>> > You'll figure that the easiest point in the pipeline to try this
>>> > 'pipelining' is after IPA has completed and until RTL is generated.
>>> >
>>> > Ideally the pipelining would start as early as the front ends
>>> > finished parsing a function and ideally we'd have multiple
>>> > functions in the RTL pipeline.
>>> >
>>> > The main obstacles will be the global state in the compiler of
>>> > which there is the least during the GIMPLE passes (mostly cfun and
>>> > current_function_decl plus globals in the individual passes which
>>> > is easiest dealt with by not allowing a single pass to run at the
>>> > same time in multiple threads). TLS can be used for some of the
>>> > global state plus of course some global data structures need
>>> > locking.

This would mean that all the passes have to be individually analyzed for which 
global state
they use, and lock/schedule them accordingly?

If this is the case, is there any documentation that describes the pre-reqs for 
each pass?
I have looked at the internal documentation, however it seems that all of this 
still has to be created?

As to how this would be implemented, it seems the easiest way would be to 
extend the macros to
accept a pre-req pass. This would encourage more documentation since the 
dependencies
become explicit instead of the current implicit ordering.

Assuming the dependencies for the all the tree-ssa passes have to be 
individually analyzed.
Currently I have this as my timeline:
- Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa 
passes (mid-may - early june)
- Test for possible reproducibility issues for the binary/debug info (early 
june - late june)
- Parallelize the rest of tree-ssa (late june - late july)
- Update documentation and test again for reproducibility issues (late july 
- early august)

Would this be acceptable?

>>> Oh, and just to mention - there are a few things that may block
>>> adoption in the end
>>> like whether builds are still reproducible (we allocate things like
>>> DECL_UID from
>>> global pools and doing that somewhat randomly because of threading
>>> might - but not
>>> must - change code generation).  Or that some diagnostics will appear
>>> in
>>> non-deterministic order, or that dump files are messed up (both
>>> issues could be
>>> solved by code dealing with the issue, like buffering and doing a re-
>>> play in
>>> program order).  I guess reproducability is important when it comes
>>> down to
>>> debugging code-generation issues - I'd prefer to debug gcc when it
>>> doesn't run
>>> threaded but if that doesn't reproduce an issue that's bad.
>>>
>>> So the most important "milestone" of this project is to identify such
>>> issues and
>>> document them somewhere.
>>
>> One issue would be the garbage-collector: there are plenty of places in
>> GCC that have hidden assumptions that "a collection can't happen here"
>> (where we have temporaries that reference GC-managed objects, but which
>> aren't tracked by GC-roots).
>>
>> I had some patches for that back in 2014 that I think I managed to drop
>> on the floor (sorry):
>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html

Would there be a way to easily create a static analyzer to find these untracked 
temporaries?

A quick look at registered passes makes me count ~135 tree-ssa passes,
So your code on analyzing what globals are referenced where might come in
handy while analyzing if passes are easily parallelized.

>> The GC's allocator is used almost everywhere, and is probably not
>> thread-safe yet.
>Yes.  There's also global tree modification like chaining new
>pointer types into TYPE_POINTER_TO and friends so some
>helpers in tree.c need to be guarded as well.
>> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
>> it's five years out-of-date, 

Re: How can compiler speed-up postgresql database?

2018-03-21 Thread Richard Biener
On Tue, Mar 20, 2018 at 8:57 PM, Martin Liška  wrote:
> Hi.
>
> I did similar stats for postgresql server, more precisely for pgbench:
> pgbench -s100 & 10 runs of pgbench -t1 -v

Without looking at the benchmark probably only because it is flawed
(aka not I/O or memory bandwidth limited).  It might have some
actual operations on data (regex code?) that we can speed up though.

Richard.

> Martin


Re: How can compiler speed-up postgresql database?

2018-03-21 Thread Martin Liška

On 03/21/2018 10:26 AM, Richard Biener wrote:

On Tue, Mar 20, 2018 at 8:57 PM, Martin Liška  wrote:

Hi.

I did similar stats for postgresql server, more precisely for pgbench:
pgbench -s100 & 10 runs of pgbench -t1 -v


Without looking at the benchmark probably only because it is flawed
(aka not I/O or memory bandwidth limited).  It might have some
actual operations on data (regex code?) that we can speed up though.


Well, it's not ideal as it tests quite simple DB with just couple of tables:
```
By default, pgbench tests a scenario that is loosely based on TPC-B, involving 
five SELECT, UPDATE, and INSERT commands per transaction.
```

Note that I had pg_data in /dev/shm and I verified that CPU utilization was 
100% on a single core.
That said, it should not be so misleading ;)

Martin



Richard.


Martin




Selective scheduling and its usage

2018-03-21 Thread Martin Liška

Hello.

I noticed there are quite many selective scheduling PRs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84872
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84842
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84659

and many others.

I want to ask you if you plan to maintain the scheduling?
Is it enabled by default for any target we support?
Should we deprecate it for GCC 8?

Thank you,
Martin


Re: Selective scheduling and its usage

2018-03-21 Thread Andrey Belevantsev
Hi Martin,

On 21.03.2018 12:48, Martin Liška wrote:
> Hello.
> 
> I noticed there are quite many selective scheduling PRs:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84872
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84842
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84659
> 
> and many others.
> 
> I want to ask you if you plan to maintain the scheduling?

Yes.  The current status is that I have patches for 83530, 83962, 83913,
83480, 83972, 80463.  I don't have patches for any of the 84* issues.
I'm planning to submit the patches for the former set and to look at the
later set next week.

I usually do most of the work by myself after internal discussions with
Alexander and other colleagues here, and there might be delays when I get
busy with unrelated stuff.  However, if there's a pressing need, we have
enough knowledgeable people to fix any sel-sched PR within a week or so.

> Is it enabled by default for any target we support?

Yes, ia64 at -O3.  The testing we make usually is like follows: bootstrap
and test on ia64, bootstrap with sel-sched enabled on x86-64, and make any
new tests from PRs be run on x86-64, ia64, and ppc.  This way I'm confident
that it mostly works on that platforms.

> Should we deprecate it for GCC 8?

No, I don't think so.

Best,
Andrey

> 
> Thank you,
> Martin



Re: Selective scheduling and its usage

2018-03-21 Thread Martin Liška

On 03/21/2018 11:17 AM, Andrey Belevantsev wrote:

Hi Martin,

On 21.03.2018 12:48, Martin Liška wrote:

Hello.

I noticed there are quite many selective scheduling PRs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84872
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84842
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84659

and many others.

I want to ask you if you plan to maintain the scheduling?


Yes.  The current status is that I have patches for 83530, 83962, 83913,
83480, 83972, 80463.  I don't have patches for any of the 84* issues.
I'm planning to submit the patches for the former set and to look at the
later set next week.


Nice!

Maybe we can create a meta bug to track all sel. scheduling issue.
May I create it?



I usually do most of the work by myself after internal discussions with
Alexander and other colleagues here, and there might be delays when I get
busy with unrelated stuff.  However, if there's a pressing need, we have
enough knowledgeable people to fix any sel-sched PR within a week or so.


Is it enabled by default for any target we support?


Yes, ia64 at -O3.  The testing we make usually is like follows: bootstrap
and test on ia64, bootstrap with sel-sched enabled on x86-64, and make any
new tests from PRs be run on x86-64, ia64, and ppc.  This way I'm confident
that it mostly works on that platforms.


Great.




Should we deprecate it for GCC 8?


No, I don't think so.


Works for me.

Thanks for clarification.
Martin



Best,
Andrey



Thank you,
Martin






Re: GSOC Question about the parallelization project

2018-03-21 Thread Richard Biener
On Wed, Mar 21, 2018 at 9:50 AM, Sebastiaan Peters
 wrote:
>>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm  wrote:
>>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
 On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
  wrote:
 > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters >>> > 7...@hotmail.com> wrote:
 > > > The goal should be to extend TU wise parallelism via make to
 > > > function
 > >
 > > wise parallelism within GCC.
 > >
 > > Could you please elaborate more on this?
 >
 > In the abstract sense you'd view the compilation process separated
 > into N stages, each function being processed by each. You'd assign
 > a thread to each stage and move the work items (the functions)
 > across the set of threads honoring constraints such as an IPA stage
 > needing all functions completed the previous stage. That allows you
 > to easier model the constraints due to shared state (like no pass
 > operating on two functions at the same time) compared to a model
 > where you assign a thread to each function.
 >
 > You'll figure that the easiest point in the pipeline to try this
 > 'pipelining' is after IPA has completed and until RTL is generated.
 >
 > Ideally the pipelining would start as early as the front ends
 > finished parsing a function and ideally we'd have multiple
 > functions in the RTL pipeline.
 >
 > The main obstacles will be the global state in the compiler of
 > which there is the least during the GIMPLE passes (mostly cfun and
 > current_function_decl plus globals in the individual passes which
 > is easiest dealt with by not allowing a single pass to run at the
 > same time in multiple threads). TLS can be used for some of the
 > global state plus of course some global data structures need
 > locking.
>
> This would mean that all the passes have to be individually analyzed for
> which global state
> they use, and lock/schedule them accordingly?

Their private global state would be excempt by assuring that a pass never
runs twice at the same time.

The global state that remains should be the same for all passes we are talking
about (during the late GIMPLE optimization phase which I'd tackle).

> If this is the case, is there any documentation that describes the pre-reqs
> for each pass?
> I have looked at the internal documentation, however it seems that all of
> this still has to be created?

The prereqs are actually the same and not very well documented (if at all).
There's the global GC memory pool where we allocate statements and
stuff like that from (and luckyly statements themselves are function private).
Then there's global trees like types ('int') where modification needs to be
appropriately guarded.  Note that "modification" means for example
building a new type for the address of 'int' given that all different
pointer types
to 'int' are chained in a list rooted in the tree for 'int'.  That
means (a few?)
tree building helpers need to be guarded with locks.  I don't have a great
idea how to identify those apart from knowing them in advance or running
into races ... my gut feeling is that there's not a lot to guard but I may
be wrong ;)

> As to how this would be implemented, it seems the easiest way would be to
> extend the macros to
> accept a pre-req pass. This would encourage more documentation since the
> dependencies
> become explicit instead of the current implicit ordering.

Actually the order is quite explicit.  Maybe I now understand your
question - no,
passes do not "communicate" between each other via global state, all such
state is per function and the execution order of passes on a given function
is hard-coded in passes.def.

> Assuming the dependencies for the all the tree-ssa passes have to be
> individually analyzed.
> Currently I have this as my timeline:
> - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa
> passes (mid-may - early june)
> - Test for possible reproducibility issues for the binary/debug info
> (early june - late june)
> - Parallelize the rest of tree-ssa (late june - late july)
> - Update documentation and test again for reproducibility issues (late
> july - early august)
>
> Would this be acceptable?

Sounds ambitious ;)  But yes, it sounds reasonable.  I don't exactly
understand what "Parallelize the rest of tree-ssa" means though.  If
you want to tackle a tiny bit more I'd rather include "RTL" by treating
the whole RTL part as a single "pass" (as said only one function can
be in RTL right now).

 Oh, and just to mention - there are a few things that may block
 adoption in the end
 like whether builds are still reproducible (we allocate things like
 DECL_UID from
 global pools and doing that somewhat randomly because of threading
 might - but not
 must - change code generation).  Or that some diagnostics will appear
 in
 non-determ

Re: Selective scheduling and its usage

2018-03-21 Thread Andrey Belevantsev
On 21.03.2018 13:31, Martin Liška wrote:
> On 03/21/2018 11:17 AM, Andrey Belevantsev wrote:
>> Hi Martin,
>>
>> On 21.03.2018 12:48, Martin Liška wrote:
>>> Hello.
>>>
>>> I noticed there are quite many selective scheduling PRs:
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84872
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84842
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84659
>>>
>>> and many others.
>>>
>>> I want to ask you if you plan to maintain the scheduling?
>>
>> Yes.  The current status is that I have patches for 83530, 83962, 83913,
>> 83480, 83972, 80463.  I don't have patches for any of the 84* issues.
>> I'm planning to submit the patches for the former set and to look at the
>> later set next week.
> 
> Nice!
> 
> Maybe we can create a meta bug to track all sel. scheduling issue.
> May I create it?

Yes, of course.  I would be happy because I can easily lose track (I have
some queries in Bugzilla about scheduling but I don't monitor whole
gcc-bugs traffic).  In fact, I wasn't aware of some PR from your list until
your mail.

Best,
Andrey

> 
>>
>> I usually do most of the work by myself after internal discussions with
>> Alexander and other colleagues here, and there might be delays when I get
>> busy with unrelated stuff.  However, if there's a pressing need, we have
>> enough knowledgeable people to fix any sel-sched PR within a week or so.
>>
>>> Is it enabled by default for any target we support?
>>
>> Yes, ia64 at -O3.  The testing we make usually is like follows: bootstrap
>> and test on ia64, bootstrap with sel-sched enabled on x86-64, and make any
>> new tests from PRs be run on x86-64, ia64, and ppc.  This way I'm confident
>> that it mostly works on that platforms.
> 
> Great.
> 
>>
>>> Should we deprecate it for GCC 8?
>>
>> No, I don't think so.
> 
> Works for me.
> 
> Thanks for clarification.
> Martin
> 
>>
>> Best,
>> Andrey
>>
>>>
>>> Thank you,
>>> Martin
>>
> 



Re: How can compiler speed-up postgresql database?

2018-03-21 Thread Jan Hubicka
> On 03/21/2018 10:26 AM, Richard Biener wrote:
> >On Tue, Mar 20, 2018 at 8:57 PM, Martin Liška  wrote:
> >>Hi.
> >>
> >>I did similar stats for postgresql server, more precisely for pgbench:
> >>pgbench -s100 & 10 runs of pgbench -t1 -v
> >
> >Without looking at the benchmark probably only because it is flawed
> >(aka not I/O or memory bandwidth limited).  It might have some
> >actual operations on data (regex code?) that we can speed up though.
> 
> Well, it's not ideal as it tests quite simple DB with just couple of tables:
> ```
> By default, pgbench tests a scenario that is loosely based on TPC-B, 
> involving five SELECT, UPDATE, and INSERT commands per transaction.
> ```
> 
> Note that I had pg_data in /dev/shm and I verified that CPU utilization was 
> 100% on a single core.
> That said, it should not be so misleading ;)

Well, it is usually easy to do perf and look how hot spots looks like.
I see similar speedups for common page lading at firefox and similar benchmarks
that are quite good.  Not everything needs to be designed to be memory bound
like spec.

Honza
> 
> Martin
> 
> >
> >Richard.
> >
> >>Martin
> 


GSOC proposal

2018-03-21 Thread Ismael El Houas Ghouddana
Dear Mr./Mrs,

First of all, I really appreciate your time and attention. I am Ismael El
Houas an aerospace engineer student with knowledge of Google Cloud Platform
and I want to express my interest in working on your project.

Secondly, I want to ask if I am still at a time to apply to this project,
unfortunately, I was not aware of GSOC until yesterday. In the case, I am
still able to apply for it, I will make the proposal as soon as possible.

Finally, many thanks for your attention.

Yours faithfully,

Ismael El Houas


Re: GSOC Question about the parallelization project

2018-03-21 Thread Sebastiaan Peters
>On Wed, Mar 21, 2018 at 9:50 AM, Sebastiaan Peters
> wrote:
>>>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm  wrote:
 On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>  wrote:
> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters  > 7...@hotmail.com> wrote:
> > > > The goal should be to extend TU wise parallelism via make to
> > > > function
> > >
> > > wise parallelism within GCC.
> > >
> > > Could you please elaborate more on this?
> >
> > In the abstract sense you'd view the compilation process separated
> > into N stages, each function being processed by each. You'd assign
> > a thread to each stage and move the work items (the functions)
> > across the set of threads honoring constraints such as an IPA stage
> > needing all functions completed the previous stage. That allows you
> > to easier model the constraints due to shared state (like no pass
> > operating on two functions at the same time) compared to a model
> > where you assign a thread to each function.
> >
> > You'll figure that the easiest point in the pipeline to try this
> > 'pipelining' is after IPA has completed and until RTL is generated.
> >
> > Ideally the pipelining would start as early as the front ends
> > finished parsing a function and ideally we'd have multiple
> > functions in the RTL pipeline.
> >
> > The main obstacles will be the global state in the compiler of
> > which there is the least during the GIMPLE passes (mostly cfun and
> > current_function_decl plus globals in the individual passes which
> > is easiest dealt with by not allowing a single pass to run at the
> > same time in multiple threads). TLS can be used for some of the
> > global state plus of course some global data structures need
> > locking.
>>
>> This would mean that all the passes have to be individually analyzed for
>> which global state
>> they use, and lock/schedule them accordingly?
>
>Their private global state would be excempt by assuring that a pass never
>runs twice at the same time.
>
>The global state that remains should be the same for all passes we are talking
>about (during the late GIMPLE optimization phase which I'd tackle).
>
>> If this is the case, is there any documentation that describes the pre-reqs
>> for each pass?
>> I have looked at the internal documentation, however it seems that all of
>> this still has to be created?
>
>The prereqs are actually the same and not very well documented (if at all).
>There's the global GC memory pool where we allocate statements and
>stuff like that from (and luckyly statements themselves are function private).
>Then there's global trees like types ('int') where modification needs to be
>appropriately guarded.  Note that "modification" means for example
>building a new type for the address of 'int' given that all different
>pointer types
>to 'int' are chained in a list rooted in the tree for 'int'.  That
>means (a few?)
>tree building helpers need to be guarded with locks.  I don't have a great
>idea how to identify those apart from knowing them in advance or running
>into races ... my gut feeling is that there's not a lot to guard but I may
>be wrong ;)

What does it mean to be a node of a type tree?
Does it describe information about that type,
or does it keep a reference to where something
of that type has been declared?

>> As to how this would be implemented, it seems the easiest way would be to
>> extend the macros to
>> accept a pre-req pass. This would encourage more documentation since the
>> dependencies
>> become explicit instead of the current implicit ordering.
>
>Actually the order is quite explicit.  Maybe I now understand your
>question - no,
>passes do not "communicate" between each other via global state, all such
>state is per function and the execution order of passes on a given function
>is hard-coded in passes.def.
>
>> Assuming the dependencies for the all the tree-ssa passes have to be
>> individually analyzed.
>> Currently I have this as my timeline:
>>     - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa
>> passes (mid-may - early june)
>>     - Test for possible reproducibility issues for the binary/debug info
>> (early june - late june)
>>     - Parallelize the rest of tree-ssa (late june - late july)
>>     - Update documentation and test again for reproducibility issues (late
>> july - early august)
>>
>> Would this be acceptable?
>
>Sounds ambitious ;)  But yes, it sounds reasonable.  I don't exactly
>understand what "Parallelize the rest of tree-ssa" means though.  If
>you want to tackle a tiny bit more I'd rather include "RTL" by treating
>the whole RTL part as a single "pass" (as said only one function can
>be in RTL right now).
>

I was under the assumption that passes had to be indivdually analysed
when I wrote that. The timeline is 

gcc-6-20180321 is now available

2018-03-21 Thread gccadmin
Snapshot gcc-6-20180321 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/6-20180321/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 6 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-6-branch 
revision 258751

You'll find:

 gcc-6-20180321.tar.xzComplete GCC

  SHA256=0a8936af8c9e159cac2d8f4a53891e3a905919780dc693a24735bb5cfec94777
  SHA1=6d38b81716fefe79b838eba91ec17257ccf42a35

Diffs from 6-20180314 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-6
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.