[GSoC 2020] Automatic Detection of Parallel Compilation Viability

2020-03-13 Thread Giuliano Belinassi via Gcc
Hi, all

I want to propose and apply for the following GSoC project: Automatic
Detection of Parallel Compilation Viability.

https://www.ime.usp.br/~belinass/Automatic_Detection_of_Parallel_Compilation_Viability.pdf

Feedback is welcome :)

Here is a markdown version of it:

**Automatic Detection of Parallel Compilation Viability**

[Giuliano Belinassi]{style="color: darkgreen"}\
Timezone: GMT$-$3:00\
University of São Paulo -- Brazil\
IRC: giulianob in \#gcc\
Email: [`giuliano.belina...@usp.br`](mailto:giuliano.belina...@usp.br)\
Github: \
Date:

About Me Computer Science Bachelor (University of São Paulo), currently
pursuing a Masters Degree in Computer Science at the same institution.
I've always been fascinated by topics such as High-Performance Computing
and Code Optimization, having worked with a parallel implementation of a
Boundary Elements Method software in GPU. I am currently conducting
research on compiler parallelization and developing the
[ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having
already presented it in [GNU Cauldron
2019](https://www.youtube.com/watch?v=jd6R3IK__1Q).

**Skills**: Strong knowledge in C, Concurrency, Shared Memory
Parallelism, Multithreaded Debugging and other typical programming
tools.

Brief Introduction

In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that
parallelizing the Intra Procedural optimizations improves speed when
compiling huge files by a factor of 1.8x in a 4 cores machine, and also
showed that this takes 75% of compilation time.

In this project we plan to use the LTO infrastructure to improve
compilation performance in the non-LTO case, with a tradeoff of
generating a binary as good as if LTO is disabled. Here, we will
automatically detect when a single file will benefit from parallelism,
and proceed with the compilation in parallel if so.

Use of LTO

The Link Time Optimization (LTO) is a compilation technique that allows
the compiler to analyse the program as a whole, instead of analysing and
compiling one file at time. Therefore, LTO is able to collect more
information about the program and generate a better optimization plan.
LTO is divided in three parts:

-   *LGEN (Local Generation)*: Each file is translated to GIMPLE. This
stage runs sequentially in each file and, therefore, in parallel in
the project compilation.

-   *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis
(IPA) in the entire program. This state runs serially in the
project.

-   *LTRANS (Local Transformation)*: Execute all Intra Procedural
Optimizations in each partition. This stage runs in parallel.

Since WPA can bottleneck the compilation because it runs serially in the
entire project, LTO was designed to produce faster binaries, not to
produce binaries fast.

Here, the proposed use of LTO to address this problem is to run the IPA
for each Translation Unit (TU), instead in the Whole Program, and
automatically detect when to partition the TU into multiple LTRANS to
improve performance. The advantage of this approach is:

-   It can generate binaries as good as when LTO is disabled.

-   It is faster, as we can partition big files into multiple partitions
and compile these partitions in parallel

-   It can interact with GNU Make Jobserver, improving CPU utilization.

Planned Tasks

I plan to use the GSoC time to develop the following topics:

-   Week \[1, 3\] -- April 27 to May 15:\
Update `cc1`, `cc1plus`, `f771`, ..., to partition the data after
IPA analysis directly into multiple LTRANS partitions, instead of
generating a temporary GIMPLE file.

-   Week \[4, 7\] -- May 18 to June 12:\
Update the `gcc` driver to take these multiple LTRANS partitions,
then call the compiler and assembler for each of them, and merge the
results into one object file. Here I will use the LTO LTRANS object
streaming, therefore it should interact with GNU Make Jobserver.

-   Week 8 -- June 15 to 19: **First Evaluation**\
Deliver a non-optimized version of the project. Some programs ought
to be compiled correctly, but probably there will be a huge overhead
because so far there will not be any criteria about when to
partition. Some tests are also planned for this evaluation.

-   Week \[9, 11\] -- June 22 to July 10:\
Implement a criteria about when to partition, and interactively
improve it based on data.

-   Week 12 --- July 13 to 17: **Second Evaluation**\
Deliver a more optimized version of the project. Here we should
filter files that would compile fast from files that would require
partitioning, and therefore we should see some speedup.

-   Week \[13, 15\] --- July 20 to August 10:\
Develop adequate tests coverage and address unexpected issues so
that this feature can be merged to trunk for the next GCC release.

-   Week 16: **Final evaluation**\
Deliver the final product as a series of patches for trun

Re: GSOC

2020-03-16 Thread Giuliano Belinassi via Gcc
Hi, Shivan.

On 03/15, shivam tiwari via Gcc wrote:
> In Which Format I have to Send Gsoc Proposal On this Mail PDF or DOC Format

Perhaps the best way to write your proposal is to write LaTeX, and
then convert it to Markdown using pandoc, and attach it to the mailing
list. One example of this approach is:

https://gcc.gnu.org/pipermail/gcc/2020-March/231825.html

Thank you,
Giuliano.


Re: [GSoC 2020] Automatic Detection of Parallel Compilation Viability

2020-03-17 Thread Giuliano Belinassi via Gcc
Hi, Richi

Thank you for your review!

On 03/16, Richard Biener wrote:
> On Fri, 13 Mar 2020, Giuliano Belinassi wrote:
> 
> > Hi, all
> > 
> > I want to propose and apply for the following GSoC project: Automatic
> > Detection of Parallel Compilation Viability.
> > 
> > Here is the proposal, and I am attaching a pdf file for better
> > readability:
> > 
> > **Automatic Detection of Parallel Compilation Viability**
> > 
> > [Giuliano Belinassi]{style="color: darkgreen"}\
> > Timezone: GMT$-$3:00\
> > University of São Paulo -- Brazil\
> > IRC: giulianob in \#gcc\
> > Email: [`giuliano.belina...@usp.br`](mailto:giuliano.belina...@usp.br)\
> > Github: \
> > 
> > About Me: Computer Science Bachelor (University of São Paulo), currently
> > pursuing a Masters Degree in Computer Science at the same institution.
> > I've always been fascinated by topics such as High-Performance Computing
> > and Code Optimization, having worked with a parallel implementation of a
> > Boundary Elements Method software in GPU. I am currently conducting
> > research on compiler parallelization and developing the
> > [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having
> > already presented it in [GNU Cauldron
> > 2019](https://www.youtube.com/watch?v=jd6R3IK__1Q).
> > 
> > **Skills**: Strong knowledge in C, Concurrency, Shared Memory
> > Parallelism, Multithreaded Debugging and other typical programming
> > tools.
> > 
> > Brief Introduction
> > 
> > In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that
> > parallelizing the Intra Procedural optimizations improves speed when
> > compiling huge files by a factor of 1.8x in a 4 cores machine, and also
> > showed that this takes 75% of compilation time.
> > 
> > In this project we plan to use the LTO infrastructure to improve
> > compilation performance in the non-LTO case, with a tradeoff of
> > generating a binary as good as if LTO is disabled. Here, we will
> > automatically detect when a single file will benefit from parallelism,
> > and proceed with the compilation in parallel if so.
> > 
> > Use of LTO
> > 
> > The Link Time Optimization (LTO) is a compilation technique that allows
> > the compiler to analyse the program as a whole, instead of analysing and
> > compiling one file at time. Therefore, LTO is able to collect more
> > information about the program and generate a better optimization plan.
> > LTO is divided in three parts:
> > 
> > -   *LGEN (Local Generation)*: Each file is translated to GIMPLE. This
> > stage runs sequentially in each file and, therefore, in parallel in
> > the project compilation.
> > 
> > -   *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis
> > (IPA) in the entire program. This state runs serially in the
> > project.
> > 
> > -   *LTRANS (Local Transformation)*: Execute all Intra Procedural
> > Optimizations in each partition. This stage runs in parallel.
> > 
> > Since WPA can bottleneck the compilation because it runs serially in the
> > entire project, LTO was designed to produce faster binaries, not to
> > produce binaries fast.
> > 
> > Here, the proposed use of LTO to address this problem is to run the IPA
> > for each Translation Unit (TU), instead in the Whole Program, and
> > automatically detect when to partition the TU into multiple LTRANS to
> > improve performance. The advantage of this approach is:
> 
> "to improve compilation performance"
> 
> > -   It can generate binaries as good as when LTO is disabled.
> > 
> > -   It is faster, as we can partition big files into multiple partitions
> > and compile these partitions in parallel
> > 
> > -   It can interact with GNU Make Jobserver, improving CPU utilization.
> 
> The previous already improves CPU utilization, I guess GNU make jobserver
> integration avoids CPU overcommit.
> 
> > Planned Tasks
> > 
> > I plan to use the GSoC time to develop the following topics:
> > 
> > -   Week \[1, 3\] -- April 27 to May 15:\
> > Update `cc1`, `cc1plus`, `f771`, ..., to partition the data after
> > IPA analysis directly into multiple LTRANS partitions, instead of
> > generating a temporary GIMPLE file.
> 
> To summarize in my own words:
> 
>   After IPA analysis partition the CU into possibly multiple LTRANS 
>   partitions even for non-LTO compilations. Invoke LTRANS compilation
>   for partitions 2..n without writing intermediate IL through mechanisms
>   like forking.
> 
> I might say that you could run into "issues" here with asm_out_file
> already opened and partially written to.  Possibly easier (but harder
> on the driver side) would be to stream LTO LTRANS IL for partitions
> 2..n and handle those like with regular LTO operation.  But I guess
> I'd try w/o writing IL first and only if it turns out too difficult
> go the IL writing way.

Ok. I changed the application text based on that.

> 
> > -   Week \[4, 7\] -- May 18 to June 12:\
> > Update the `gcc` driver 

Re: [GSoC 2020] Automatic Detection of Parallel Compilation Viability

2020-03-17 Thread Giuliano Belinassi via Gcc
Hi, all

I have applied some revews to the project. Please see the new proposal
here:

https://www.ime.usp.br/~belinass/Automatic_Detection_of_Parallel_Compilation_Viability.pdf

**Automatic Detection of Parallel Compilation Viability**

[Giuliano Belinassi]{style="color: darkgreen"}\
Timezone: GMT$-$3:00\
University of São Paulo -- Brazil\
IRC: giulianob in \#gcc\
Email: [`giuliano.belina...@usp.br`](mailto:giuliano.belina...@usp.br)\
Github: \
Date:

About Me Computer Science Bachelor (University of São Paulo), currently
pursuing a Masters Degree in Computer Science at the same institution.
I've always been fascinated by topics such as High-Performance Computing
and Code Optimization, having worked with a parallel implementation of a
Boundary Elements Method software in GPU. I am currently conducting
research on compiler parallelization and developing the
[ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having
already presented it in [GNU Cauldron
2019](https://www.youtube.com/watch?v=jd6R3IK__1Q).

**Skills**: Strong knowledge in C, Concurrency, Shared Memory
Parallelism, Multithreaded Debugging and other typical programming
tools.

Brief Introduction

In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that
parallelizing the Intra Procedural optimizations improves speed when
compiling huge files by a factor of 1.8x in a 4 cores machine, and also
showed that this takes 75% of compilation time.

In this project we plan to use the LTO infrastructure to improve
compilation performance in the non-LTO case, with a tradeoff of
generating a binary as good as if LTO is disabled. Here, we will
automatically detect when a single file will benefit from parallelism,
and procceed with the compilation in parallel if so.

Use of LTO

The Link Time Optimization (LTO) is a compilation technique that allows
the compiler to analyse the program as a whole, instead of analysing and
compiling one file at time. Therefore, LTO is able to collect more
information about the program and generate a better optimization plan.
LTO is divided in three parts:

-   *LGEN (Local Generation)*: Each file is translated to GIMPLE. This
stage runs sequentially in each file and, therefore, in parallel in
the project compilation.

-   *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis
(IPA) in the entire program. This state runs serially in the
project.

-   *LTRANS (Local Transformation)*: Execute all Intra Procedural
Optimizations in each partition. This stage runs in parallel.

Since WPA can bottleneck the compilation because it runs serially in the
entire project, LTO was designed to produce faster binaries, not to
produce binaries fast.

Here, the proposed use of LTO to address this problem is to run the IPA
for each Translation Unit (TU), instead in the Whole Program, and
automatically detect when to partition the TU into multiple LTRANS to
improve compilation performance. The advantage of this approach is:

-   It can generate binaries as good as when LTO is disabled.

-   It is faster, as we can partition big files into multiple partitions
and compile these partitions in parallel.

-   It can interact with GNU Make Jobserver, improving CPU utilization.

Planned Tasks

I plan to use the GSoC time to develop the following topics:

-   Week \[1, 3\] -- April 27 to May 15:\
Update `cc1`, `cc1plus`, `f771`, ..., to partition the Compilation
Unit (CU) after IPA analysis directly into multiple LTRANS
partitions, instead of generating a temporary GIMPLE file, and to
accept a additional parameter `-fsplit-outputs=`, in which
the generated ASM filenames will be written to.

There are two possible cases in which I could work on:

1.  *Fork*: After the CU is partitioned into multiple LTRANS, then
`cc1` will fork and compile these partitions, each of them
generating a ASM file, and write the generated asm name into
``. Note that if the number of partitions is one, then
this part is not necessary.

2.  *Stream LTRANS IR*: After CU is partitionated into multiple
LTRANS, then `cc1` will write these partitions into disk so that
LTO can read these files and proceed as a standard LTO operation
in order to generate a partially linked object file.

1\. Has the advantage of having less overhead than 2., as there is less
IO operations, however it may be hard to implement as the assembler file
may be already opened before forking, so caution is necessary to make
sure that there are a 1 - 1 relationship between assembler file and the
compilation process. 2. on the other hand can easily interact with the
GNU jobserver.

-   Week \[4, 7\] -- May 18 to June 12:\
Update the `gcc` driver to take each file in ``, then
assemble and partially link them together. Here, an important
optimization is to use a named pipe in `` to avoid having
to 

Re: [GSoC 2020] Automatic Detection of Parallel Compilation Viability

2020-03-23 Thread Giuliano Belinassi via Gcc
Hi, Richi

On 03/18, Richard Biener wrote:
> On Tue, 17 Mar 2020, Giuliano Belinassi wrote:
> 
> > Hi, all
> > 
> > I have applied some revews to the project. Please see the new proposal
> > here:
> 
> Looks good, some editorial changes below
> 
> > https://www.ime.usp.br/~belinass/Automatic_Detection_of_Parallel_Compilation_Viability.pdf
> > 
> > **Automatic Detection of Parallel Compilation Viability**
> > 
> > [Giuliano Belinassi]{style="color: darkgreen"}\
> > Timezone: GMT$-$3:00\
> > University of São Paulo -- Brazil\
> > IRC: giulianob in \#gcc\
> > Email: [`giuliano.belina...@usp.br`](mailto:giuliano.belina...@usp.br)\
> > Github: \
> > Date:
> > 
> > About Me Computer Science Bachelor (University of São Paulo), currently
> > pursuing a Masters Degree in Computer Science at the same institution.
> > I've always been fascinated by topics such as High-Performance Computing
> > and Code Optimization, having worked with a parallel implementation of a
> > Boundary Elements Method software in GPU. I am currently conducting
> > research on compiler parallelization and developing the
> > [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having
> > already presented it in [GNU Cauldron
> > 2019](https://www.youtube.com/watch?v=jd6R3IK__1Q).
> > 
> > **Skills**: Strong knowledge in C, Concurrency, Shared Memory
> > Parallelism, Multithreaded Debugging and other typical programming
> > tools.
> > 
> > Brief Introduction
> > 
> > In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that
> > parallelizing the Intra Procedural optimizations improves speed when
> > compiling huge files by a factor of 1.8x in a 4 cores machine, and also
> > showed that this takes 75% of compilation time.
> > 
> > In this project we plan to use the LTO infrastructure to improve
> > compilation performance in the non-LTO case, with a tradeoff of
> > generating a binary as good as if LTO is disabled. Here, we will
> > automatically detect when a single file will benefit from parallelism,
> > and procceed with the compilation in parallel if so.
> > 
> > Use of LTO
> > 
> > The Link Time Optimization (LTO) is a compilation technique that allows
> > the compiler to analyse the program as a whole, instead of analysing and
> > compiling one file at time. Therefore, LTO is able to collect more
> > information about the program and generate a better optimization plan.
> > LTO is divided in three parts:
> > 
> > -   *LGEN (Local Generation)*: Each file is translated to GIMPLE. This
> > stage runs sequentially in each file and, therefore, in parallel in
> > the project compilation.
> > 
> > -   *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis
> > (IPA) in the entire program. This state runs serially in the
> > project.
> > 
> > -   *LTRANS (Local Transformation)*: Execute all Intra Procedural
> > Optimizations in each partition. This stage runs in parallel.
> > 
> > Since WPA can bottleneck the compilation because it runs serially in the
> > entire project, LTO was designed to produce faster binaries, not to
> > produce binaries fast.
> > 
> > Here, the proposed use of LTO to address this problem is to run the IPA
> > for each Translation Unit (TU), instead in the Whole Program, and
> 
> This proposal is to use LTO to produce binaries fast by running
> the IPA phase separately for each Translation Unit (TU), instead of on the 
> Whole Program and ...
> 
> > automatically detect when to partition the TU into multiple LTRANS to
> > improve compilation performance. The advantage of this approach is:
> > 
> > -   It can generate binaries as good as when LTO is disabled.
> > 
> > -   It is faster, as we can partition big files into multiple partitions
> > and compile these partitions in parallel.
> > 
> > -   It can interact with GNU Make Jobserver, improving CPU utilization.
> 
> This reads a bit odd, regular compilation already interacts with the
> GNU Make Jobserver.  I'd reorder and reword it w/o dashes like
> 
> We can partition big files into multiple partitions and compile these 
> partitions in parallel which should improve CPU utilization by exposing
> smaller chunks to the GNU Make Jobserver.  Code generation quality
> should be unaffected by this.

How about:

```
The advantage of this approach is: by partitioning big files into
multiple partitions, we can improve the compilation performance by
exposing these partitions to the Jobserver. Therefore, it can improve
CPU utilization in manycore machines.  Generated code quality should be
unaffected by this procedure, which means that it should run as fast as
when LTO is disabled.
```
?

> 
> Thanks,
> Richard.
> 
> > Planned Tasks
> > 
> > I plan to use the GSoC time to develop the following topics:
> > 
> > -   Week \[1, 3\] -- April 27 to May 15:\
> > Update `cc1`, `cc1plus`, `f771`, ..., to partition the Compilation
> > Unit (CU) after IPA analysis directly into multiple LTRANS

Re: [GSoC 2020] Automatic Detection of Parallel Compilation Viability

2020-03-24 Thread Giuliano Belinassi via Gcc
Hi, all.

I am updating the timeline, since it was shifted due to SARS-CoV-2. Here
is the updated version:

-   Week \[1, 4\] -- May 4 to May 27:\
Update `cc1`, `cc1plus`, `f771`, ..., to partition the Compilation
Unit (CU) after IPA analysis directly into multiple LTRANS
partitions, instead of generating a temporary GIMPLE file, and to
accept a additional parameter `-fsplit-outputs=`, in which
the generated ASM filenames will be written to.

There are two possible cases in which I could work on:

1.  *Fork*: After the CU is partitioned into multiple LTRANS, then
`cc1` will fork and compile these partitions, each of them
generating a ASM file, and write the generated asm name into
``. Note that if the number of partitions is one, then
this part is not necessary.

2.  *Stream LTRANS IR*: After CU is partitionated into multiple
LTRANS, then `cc1` will write these partitions into disk so that
LTO can read these files and proceed as a standard LTO operation
in order to generate a partially linked object file.

1\. Has the advantage of having less overhead than 2., as there is less
IO operations, however it may be hard to implement as the assembler file
may be already opened before forking, so caution is necessary to make
sure that there are a 1 - 1 relationship between assembler file and the
compilation process. 2. on the other hand can easily interact with the
GNU jobserver.

-   Week \[5, 8\] -- June 1 to June 26:\
Update the `gcc` driver to take each file in ``, then
assemble and partially link them together. Here, an important
optimization is to use a named pipe in `` to avoid having
to wait every partition to end its compilation before assembling the
files.

-   Week 9 -- June 29 to July 3: **First Evaluation**\
Deliver a non-optimized version of the project. Some programs ought
to be compiled correctly, but probably there will be a huge overhead
because so far there is no way of interacting with GNU Jobserver.

-   Week \[10, 12\] -- July 6 to July 24:\
Work on GNU Make Jobserver integration. A way of doing this is to
adapt the LTO WPA $\rightarrow$ LTRANS way of interacting with
Jobserver. Another way is to make the forked `cc1` consume Jobserver
tokens until the compilation finishes, then return the token when
done.

-   Week 13 -- July 27 to 31: **Second Evaluation**\
Deliver a more optimized version of the project. Here we should
filter files that would compile fast from files that would require
partitioning, and interact with GNU Jobserver. Therefore we should
see some speedup.

-   Week \[14, 16\] -- August 3 to 21:\
Develop adequate tests coverage and address unexpected issues so
that this feature can be merged to trunk for the next GCC release.

-   Week 17: August 24 to 31 **Final evaluation**\
Deliver the final product as a series of patches for trunk.

Thank you,
Giuliano.

On 03/24, Richard Biener wrote:
> On Mon, 23 Mar 2020, Giuliano Belinassi wrote:
> 
> > Hi, Richi
> > 
> > On 03/18, Richard Biener wrote:
> > > On Tue, 17 Mar 2020, Giuliano Belinassi wrote:
> > > 
> > > > Hi, all
> > > > 
> > > > I have applied some revews to the project. Please see the new proposal
> > > > here:
> > > 
> > > Looks good, some editorial changes below
> > > 
> > > > https://www.ime.usp.br/~belinass/Automatic_Detection_of_Parallel_Compilation_Viability.pdf
> > > > 
> > > > **Automatic Detection of Parallel Compilation Viability**
> > > > 
> > > > [Giuliano Belinassi]{style="color: darkgreen"}\
> > > > Timezone: GMT$-$3:00\
> > > > University of São Paulo -- Brazil\
> > > > IRC: giulianob in \#gcc\
> > > > Email: [`giuliano.belina...@usp.br`](mailto:giuliano.belina...@usp.br)\
> > > > Github: \
> > > > Date:
> > > > 
> > > > About Me Computer Science Bachelor (University of São Paulo), currently
> > > > pursuing a Masters Degree in Computer Science at the same institution.
> > > > I've always been fascinated by topics such as High-Performance Computing
> > > > and Code Optimization, having worked with a parallel implementation of a
> > > > Boundary Elements Method software in GPU. I am currently conducting
> > > > research on compiler parallelization and developing the
> > > > [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having
> > > > already presented it in [GNU Cauldron
> > > > 2019](https://www.youtube.com/watch?v=jd6R3IK__1Q).
> > > > 
> > > > **Skills**: Strong knowledge in C, Concurrency, Shared Memory
> > > > Parallelism, Multithreaded Debugging and other typical programming
> > > > tools.
> > > > 
> > > > Brief Introduction
> > > > 
> > > > In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that
> > > > parallelizing the Intra Procedural optimizations improves speed when
> > > > compiling huge files by a factor of 1.8x in a 4 cores machine, an

Re: Help implementing support for vec in gengtype

2020-04-20 Thread Giuliano Belinassi via Gcc
Hi. Sorry for the late reply.

On 03/02, Richard Biener wrote:
> On Thu, Feb 27, 2020 at 6:56 PM Giuliano Belinassi
>  wrote:
> >
> > Hi, all.
> >
> > I am tying to fix an issue with a global variable in the parallel gcc
> > project. For this, I am trying to move some global variables from
> > tree-ssa-operands to struct function. One of this variable is a
> > vec type, and gengtype doesn't look so happy with it.
> 
> I think the solution for this is to not move it to struct function
> but instead have it local to function scope - the state is per
> GIMPLE stmt we process and abstracting what is
> cleaned up in cleanup_build_arrays() into a class we can
> instantiate in the two callers is most appropriate.  In theory
> all the functions that access the state could move into the
> class as methods as well but you can pass down the state
> everywhere needed as well.

I implemented this strategy, but the issue remains. Therefore, the
cause of it must be something else.

Just to contextualize, in [1], I also implemented parallelism in
ParallelGcc using a pipeline method for testing, where I split the set
of GIMPLE Intra Procedural Optimizations into multiple sets, and assign
each set to a thread.  Then, the function passes through this pipeline.

Now, I am trying to make this version pass the testsuite. There is a test
in particular ('gcc.dg/20031102-1.c') that I am having difficulties
finding the cause of the issue.

if I run:

/tmp/gcc10_parallel/usr/local/bin/gcc --param=num-threads=2 -O2 -c 20031102-1.c

The crash message is:

```
>
used static ignored external VOID :0:0
align:8 warn_if_not_align:0>

In function ‘main’:
cc1: error: virtual use of statement not up to date
# VUSE <.MEM_1(D)>
_2 = FooBar ();
during GIMPLE pass: walloca
cc1: internal compiler error: verify_ssa failed
0xfdb8fe verify_ssa(bool, bool)
../../gcc/gcc/tree-ssa.c:1208
0xcd3d08 execute_function_todo
../../gcc/gcc/passes.c:2017
0xcd49f2 execute_todo
../../gcc/gcc/passes.c:2064
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.
```

Which is triggered by tree-ssa-operands.c:1066 in this branch, checking
if build_vuse != use. Interestingly, this crash does not happens if:

1 - I set the number of threads to 1.
2 - I set the optimization level to O0, O1 or O3.
3 - I disable O2, but enable all flags enabled by O2
(gcc -O2 -Q --help=optimizer).
4 - I left the first 115 passes in the same thread with a parameter I
implmemented (--param=num-threads=2 --param=break=116). Any value
smaller that this causes the issue.

The crash is also consistent, which means that it happens 100% of time.

Any light concerning this issue is welcome. :)

If anyone want to reproduce the issue, you can clone this branch, then compile
with --disable-bootstrap --enable-language=c, and install it like trunk's GCC.

[1] - https://gitlab.com/flusp/gcc/-/tree/giulianob_parallel_pipeline


Thank you,
Giuliano.

> 
> Richard.
> 
> > In this context, I am trying to add support to vec to gengtype.
> > Therefore, I first fixed a problem where gengtype couldn't find the
> > tree union by:
> >
> > diff --git a/gcc/gengtype.c b/gcc/gengtype.c
> > index 53317337cf8..6f4c77020ea 100644
> > --- a/gcc/gengtype.c
> > +++ b/gcc/gengtype.c
> > @@ -638,7 +638,10 @@ create_user_defined_type (const char *type_name, 
> > struct fil
> > eloc *pos)
> >   /* Strip off the first '*' character (and any subsequent 
> > text). */
> >   *(field_name + offset_to_star) = '\0';
> >
> > - arg_type = find_structure (field_name, TYPE_STRUCT);
> > + arg_type = resolve_typedef (field_name, pos);
> > + if (!arg_type)
> > +   arg_type = find_structure (field_name, TYPE_STRUCT);
> > +
> >   arg_type = create_pointer (arg_type);
> > }
> >   else
> >
> > After this patch, gengtype seems to correctly detect vec types,
> > but then I face linking issues. At first, the compiler could not find
> > gt_ggc_mx (vec *v) and gt_pch_mx (vec *v), therefore I implemented
> > them both in gcc/vec.h:
> >
> > diff --git a/gcc/vec.h b/gcc/vec.h
> > index 091056b37bc..dfa744b684e 100644
> > --- a/gcc/vec.h
> > +++ b/gcc/vec.h
> > @@ -1306,6 +1306,15 @@ vec::quick_grow_cleared (unsigned 
> > len)
> >  vec_default_construct (address () + oldlen, growby);
> >  }
> >
> > +template
> > +void
> > +gt_ggc_mx (vec *v)
> > +{
> > +  extern void gt_ggc_mx (T &);
> > +  for (unsigned i = 0; i < v->length (); i++)
> > +gt_ggc_mx ((*v)[i]);
> > +}
> > +
> >  /* Garbage collection support for vec.  */
> >
> >  template
> > @@ -1328,6 +1337,15 @@ gt_ggc_mx (vec *v 
> > ATTRIBUTE_UNUSED)
> >
> >  /* PCH support for vec.  */
> >
> > +template
> > +void
> > +gt_pch_nx (vec *v)
> > +{
> > +  extern void gt_pch_nx (T &);
> > +  for (unsigned i = 0; i < v->length 

Re: Help implementing support for vec in gengtype

2020-04-21 Thread Giuliano Belinassi via Gcc
Hi, Richi

On 04/21, Richard Biener wrote:
> On Mon, Apr 20, 2020 at 11:45 PM Giuliano Belinassi
>  wrote:
> >
> > Hi. Sorry for the late reply.
> >
> > On 03/02, Richard Biener wrote:
> > > On Thu, Feb 27, 2020 at 6:56 PM Giuliano Belinassi
> > >  wrote:
> > > >
> > > > Hi, all.
> > > >
> > > > I am tying to fix an issue with a global variable in the parallel gcc
> > > > project. For this, I am trying to move some global variables from
> > > > tree-ssa-operands to struct function. One of this variable is a
> > > > vec type, and gengtype doesn't look so happy with it.
> > >
> > > I think the solution for this is to not move it to struct function
> > > but instead have it local to function scope - the state is per
> > > GIMPLE stmt we process and abstracting what is
> > > cleaned up in cleanup_build_arrays() into a class we can
> > > instantiate in the two callers is most appropriate.  In theory
> > > all the functions that access the state could move into the
> > > class as methods as well but you can pass down the state
> > > everywhere needed as well.
> >
> > I implemented this strategy, but the issue remains. Therefore, the
> > cause of it must be something else.
> 
> Btw, it would be nice to push those changes as cleanups during
> next stage1.
> 
> > Just to contextualize, in [1], I also implemented parallelism in
> > ParallelGcc using a pipeline method for testing, where I split the set
> > of GIMPLE Intra Procedural Optimizations into multiple sets, and assign
> > each set to a thread.  Then, the function passes through this pipeline.
> >
> > Now, I am trying to make this version pass the testsuite. There is a test
> > in particular ('gcc.dg/20031102-1.c') that I am having difficulties
> > finding the cause of the issue.
> >
> > if I run:
> >
> > /tmp/gcc10_parallel/usr/local/bin/gcc --param=num-threads=2 -O2 -c 
> > 20031102-1.c
> >
> > The crash message is:
> >
> > ```
> >  > type  > align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 
> > 0x7f44334fb000
> > pointer_to_this >
> > used static ignored external VOID :0:0
> > align:8 warn_if_not_align:0>
> >
> > In function ‘main’:
> > cc1: error: virtual use of statement not up to date
> > # VUSE <.MEM_1(D)>
> > _2 = FooBar ();
> > during GIMPLE pass: walloca
> 
> ^^^
> 
> so this pass isn't supposed to change anything which either means you're
> missing some global state in the verify_ssa checker.  Notably the actual
> verification error triggering checks the underlying .MEM symbol (a VAR_DECL).
> Since your message above only shows one var_decl build_vuse must be
> NULL somehow.
> 
> Now it could be that the pass_local_pure_const (the late one) changes 'FooBar'
> to be const which means it wouldn't get a virtual operand anymore.  Looking
> at the testcase that's likely the issue here.
> 
> That's a "tough one" and would ask for the const/pure-ness of call stmts
> to be encoded in the call stmt itself (this issue is also one reason for
> the fixup_cfg passes we have).  So instead of looking at the decl we'd
> track this via a gimple_call_flags () flag and update that from the decl
> at known points (for example when updating SSA operands (sic!) but
> exactly not when just verifying them).
> 
> So for your branch try adding a verifying_p member to the class
> and when verifying instead of
> 
>   /* If aliases have been computed already, add VDEF or VUSE
>  operands for all the symbols that have been found to be
>  call-clobbered.  */
>   if (!(call_flags & ECF_NOVOPS))
> {
>   /* A 'pure' or a 'const' function never call-clobbers anything.  */
>   if (!(call_flags & (ECF_PURE | ECF_CONST)))
> add_virtual_operand (fn, stmt, opf_def);
>   else if (!(call_flags & ECF_CONST))
> add_virtual_operand (fn, stmt, opf_use);
> }
> 
> rely on existing vuse/vdef like
> 
>   if (verifying_p)
> {
>/* ???  Track const/pure/novop-ness in gimple call flags.  */
>if (gimple_vdef (stmt))
> add_virtual_operand (...);
>else if (gimple_vuse (stmt))
> add_virtual_operand (...);
>return;
> }
> 
> and call it a day ;)

That indeed worked! Thank you. This one in particular was really tough!

I will prepare a patch about these changes to trunk ready for stage1.
There are some unused stuff that I found here that is nice to have it
cleaned.

I am just curious about how it was working before these changes, once it
seems not to be a race condition. Or probaly there is a race condition
lost somewhere that was triggering it?

Thank you,
Giuliano.

> 
> > cc1: internal compiler error: verify_ssa failed
> > 0xfdb8fe verify_ssa(bool, bool)
> > ../../gcc/gcc/tree-ssa.c:1208
> > 0xcd3d08 execute_function_todo
> > ../../gcc/gcc/passes.c:2017
> > 0xcd49f2 execute_todo
> > ../../gcc/gcc/passes.c:2064
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > Please include the complete backtrace with any

Writing automated tests for the GCC driver

2020-05-21 Thread Giuliano Belinassi via Gcc
Hi, all.

GCC have a extensive testsuite, that is no news at all. However they are
focused on the compiler (cc1*) or in libraries, and I can't find tests
related to the GCC driver.

Are there tests to the GCC driver? If yes, is there any docs about how
to write them?

Thank you,
Giuliano.


Re: Writing automated tests for the GCC driver

2020-05-25 Thread Giuliano Belinassi via Gcc
Hi,

On 05/22, Richard Biener wrote:
> On Thu, May 21, 2020 at 11:00 PM Giuliano Belinassi via Gcc
>  wrote:
> >
> > Hi, all.
> >
> > GCC have a extensive testsuite, that is no news at all. However they are
> > focused on the compiler (cc1*) or in libraries, and I can't find tests
> > related to the GCC driver.
> >
> > Are there tests to the GCC driver? If yes, is there any docs about how
> > to write them?
> 
> I think all tests to the driver eventually exercise a compiler in the end.
> One obvious I can find is gcc.dg/driver-specs.c which tests
> -specs=not-a-file properly diagnoses the missing file.

Yes, but that does not cover all driver features.

> 
> So the question back would be what kind of "driver" tests do you have?
> That is, what makes them not, for example, "C compiler driven by driver" 
> tests?

GCC driver supports several modes which are required for bootstrapping
but there is no quick automated test for it. For instance.

gcc a.c b.o -o a.out
gcc a.c b.c
gcc a.S

and so on. So if you do some radical change to the GCC driver, making
sure everything is correct get somewhat painful because you have to do
a clean bootstrap and find out what is going wrong. If we had a
dedicated testsuite for that, bootstrap could be avoided for
development and only done at a last time, reducing development time.

Thank you,
Giuliano.

> 
> Thanks,
> Richard.
> 
> > Thank you,
> > Giuliano.


[GSOC] Automatic Parallel Compilation Viability, Report 1

2020-07-02 Thread Giuliano Belinassi via Gcc
Hi,

I am attaching here the first report of the
"Automatic Parallel Viability" project. Please feel free to suggest
improvements to the project.

The content below is presented in markdown format, and you can easily
convert it to PDF with pandoc if you feel it uncomfortable to read in
current format.


```
# Automatic Parallel Compilation Viability: First Report

## Complete Tasks

There were two tasks scheduled to be complete in this stage:

1. Update the driver to pass a hidden `-fsplit-outputs` to the compiler,
writting down the path to each assembler file generated by the compiler.

2. Update the compiler to partition the program, and fork on each
partition, generating one asm file per partition. The compiler should
compile some programs correctly at this point.

Both of these tasks were completed at this point, as we have a version
of GCC that bootstraps and have a small speedup when compiling
insn-emit.c in parallel (with --enable-checking, 42s sequential w/o
partitioning vs. 36s parallel).

## First Implementation

The first working implementation consists of two parts:

- A partitioner designed to handle the Compilation Unit as if it only
  have part of the program information, as opposed as the WPA
  partitioner that have the entire program callgraph. This imposes some
  restriction on us about our partitioner, that is:

1. Static functions that calls eachother are mapped to the same
partition to avoid renaming and promotion to globals.

2. Functions that have references to a same global variable are
mapped into the same partition to avoid multiple definitions of this
partition.

3. Functions which address are taken partitioned together with
whoever took the address and with calls that has the possibility
to receive the address as argument.

4. COMDATs should be in the same partition.

- A way to apply a `ltrans_partition` without dumping the partition
  to disk and reloading it. This is done by looking into the
  partitioner generated encoder and releasing information about
  functions that are not in the partition by releasing function
  bodies, dominator tree, entries at clone tree, and setting the
  variables accordingly.

This implementation lead to very unbalanced partitions, as it generate
one really big partition and several very small partitions. However, the
partitioner showed to be very fast in practice, with preliminary
complexity of O(|E| * log\*(|V|)), which is approximately linear on the
number of edges of the graph, as log\*(2^64) = 5. Sure the worst case
scenario where the callgraph is the complete graph implies |E| =
|V|\*(|V| - 1)/2, but that looks very synthetic to a input program.

## Ongoing Implementation

This implementation is still work on progress, but has showed
significant improvements when compared to the first implementation.
First we relax the condition 1. by not merging calls that were inlined,
and relax condition 2. by not merging functions that references
non-static global variables.

We managed to get this version to bootstrap by commenting some
assertions at ipa-cp optimization pass, probably because `node->local`
is incorrectly set in some cases, and calling `remove_unreachable_nodes`
after applying the partition to the callgraph, but marking nodes inside
the partition to be `force_output` before entering that function so
there is no risk of being removed incorrectly, and unseting
`force_output` afterwards for nodes that wasn't set beforehand.

This methodology yielded a speedup of 2x on a Quad Core Intel Core-i7
8565U CPU when compiling insn-emit.c
(24.509s parallel vs. 45.909s sequential), but no speedup at all when
compiling gimple-match.c, this file looks really problematic on current
methodology because it has 3 entry points and is stuffed with static
functions.

We are also exploring the possibility of promoting non-inlined static
functions to globals in some cases where it might improve compilation
performance, although it may slowdown the produced binary in some
architectures, therefore completely removing restriction 1. Our results
with that restriction relaxed on gimple-match.c showed a 2.12x
improvement over serial compilation (36.184s parallel vs. 1m17s
sequential).

## Conclusions

So far the work has shows to be promising. Compilation of insn-emit.c
is greatly improved with 2x speedup although no improvements to
gimple-match.c was observed without relaxing condition 1. Once this
new partitioner bootstrap, we will proceed to implementing support
for the GNU Jobserver.
```


[GSoC] Automatic Detection of Parallel Compilation Viability, Report 2

2020-07-29 Thread Giuliano Belinassi via Gcc
Hi.

Here I present the second report of the
"Automatic Detection of Parallel Compilation Viability" in Markdown
format.

---

# Automatic Parallel Compilation Viability: Second Report

## Complete Tasks

For the second evaluation, we expected to have a working version of the
project that result in speedups when compile large files, and have a
working integration with the GNU Jobserver. Both of these tasks are
complete, and I will discuss some decisions made to achive this.

## Compilation Unit Partitioning

In the first implementation, we had a bug where partitioning were done
before LTRANS stage were executed, implying in information loss and
undesired collateral effects such as bad inlining. This was fixed later
on.

After this, we decided to implement two modes of partitioning:

1. Partition without static promotion. This is the safer method to use,
as we do not modify symbols in the Compilation Unit. This may lead to
speedups in files that have multiple entries points with low
connectivity between then (such as insn-emit.c), however this will not
provide speedups when this hypothesis is not true (gimple-match.c is an
example of this).

2. Partition with static promotion to global. This is a more aggressive
method, as we can decide to promote some functions to global to increase
parallelism opportunity. This also will change the final assembler name
of the promoted function to avoid collision with functions of others
Compilation Units. To use this mode, the user has to manually specify
--param=promote-statics=1, as they must be aware of this.

We also changed the conditions to decide whether a symbol will be in
some partition or not, as implemented in `lto_merge_comdats_map` in
lto-cgraph.c:

1. Find what symbols may have to bring COMDATs to its partition. We do
that by defining a COMDAT frontier and marking all these symbols to be
partitioned together.

2. Find symbols that may be part of other symbols, and mark them to be
partitioned together.

3. If static promotion is disabled, mark symbols that references and
calls static functions, and mark them to be partitioned together.

Although condition 1. required some more complicated analysis, this
reduced code complexity of the previous partitioner.


## Jobserver Integration

We implemented a interface to communicate with the GNU Make's Jobserver
that is able to detect when the GNU Make Jobserver is active, thanks to
Nathan Sidwell. This works as follows:

When -fparallel-jobs=jobserver is provided, GCC will try to detect if
there is a running Jobserver in which we can communicate to. If true,
we return the token that Make originally gave to us, then we wait for
make for a new token that, when provided, will launch a forked child
process with the partition information; or fall back to default
compilation if Jobserver is not detected with a warning. Now if
-fparallel-jobs=auto, this warning will not be provided and the default
number of jobs will be set to 2, unless a single core CPU is detected.
For more information about the implementation, check gcc/jobserver.cc
file in the autopar_devel branch.

## Current Status

Currently, we have the following scenario:

1. Bootstrap works with promote statics disabled, and when enabled
object comparison fails on stage3 because promoted symbols name are
randomized for now. This impacts reproducible builds and safer methods
must be employed to fix this, such as maybe using the file SHA hash
as a way to avoid name colision.

2. Jobserver integration works but modifications to the Makefile are
necessary. So far, the current branch is not able to bootstrap with
-fparallel-jobs=jobserver because such modifications are required to
GCC's Makefiles. However, we managed to compile Git with this option
enabled, therefore supporting the claim that it is working to a certain
level.

3. Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U
when testing with two files in GCC, as stated in the following table.
The test was the result of a single execution with a previous warm up
execution. The compiled GCC had checking enabled, and therefore release
version might have better timings in both sequential and parallel, but the
speedup may remain the same.

||| Without Static | With Static | |
| File   | Sequential |Promotion   |  Promotion  | Speedup |
||||---|
| gimple-match.c | 60s|   63s  | 34s |   1.7x  |
| insn-emit.c| 37s|   19s  | 20s |   1.9x  |

4. Supported Frontends: C, C++, Fortran. Others frontends can also be
easily supported if they do not require the GCC driver to spawn multiple
processes to compile a single file to assembler.

## How to use this

1. Clone the autopar_devel branch:
```
git clone --single-branch --branch devel/autopar_devel \
  git://gcc.gnu.org/git/gcc.git

[GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-28 Thread Giuliano Belinassi via Gcc
Hi,

This is the final report of the "Automatic Parallel Compilation
Viability" project.  Please notice that this report is pretty
similar to the delivered from the 2nd evaluation, as this phase
consisted of mostly rebasing and bug fixing.

Please, reply this message for any question or suggestion.

Thank you,
Giuliano.

--- 8< ---

# Automatic Parallel Compilation Viability: Final Report

## Complete Tasks

For the third evaluation, we expected to deliver the product as a
series of patches for trunk.  The patch series were in fact delivered
[1], but several items must be fixed before merge.


Overall, the project works and speedups ranges from 0.95x to 3.3x.
Bootstrap is working, and therefore this can be used in an experimental
state.

## How to use

1. Clone the autopar_devel branch:
```
git clone --single-branch --branch devel/autopar_devel \
  git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
```
2. Follow the standard compilation options provided in the Compiling
GCC page, and install it on some directory. For instance:

```
cd gcc_autopar_devel
mkdir build && cd build
../configure --disable-bootstrap --enable-languages=c,c++
make -j 8
make DESTDIR=/tmp/gcc11_autopar install
```

3. If you want to test whether your version is working, just launch
Gcc with `-fparallel-jobs=2` when compiling a file with -c.

5. If you want to compile a project with this version it uses GNU
Makefiles, you must modify the compilation rule command and prepend a
`+` token to it. For example, in Git's Makefile, Change:
```
$(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) $(EXTRA_CPPFLAGS) 
$<
```
to:
```
$(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
+$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
$(EXTRA_CPPFLAGS) $<
```
as well as point the CC variable to the installed gcc, and
append a `-fparallel-jobs=jobserver` on your CFLAGS variable.

# How the parallelism works in this project

In LTO, the Whole Program Analysis decides how to partition the
callgraph for running the LTRANS stage in parallel.  This project
works very similar to this, however with some changes.

The first was to modify the LTO structure so that it accepts
the compilation without IR streaming to files.  This avoid an IO
overhead when compiling in parallel.

The second was to use a custom partitioner to find which nodes
should be in the same partition.  This was mainly done to bring COMDAT
together, as well as symbols that are part of other symbols, and even
private symbols so that we do not output hidden global symbols.

However, experiment showed that bringing private symbols together did
not yield a interesting speedup on some large files, and therefore
we implemented two modes of partitioning:

1. Partition without static promotion. This is the safer method to use,
as we do not modify symbols in the Compilation Unit. This may lead to
speedups in files that have multiple entries points with low
connectivity between then (such as insn-emit.c), however this will not
provide speedups when this hypothesis is not true (gimple-match.c is an
example of this). This is the default mode.

2. Partition with static promotion to global. This is a more aggressive
method, as we can decide to promote some functions to global to increase
parallelism opportunity. This also will change the final assembler name
of the promoted function to avoid collision with functions of others
Compilation Units. To use this mode, the user has to manually specify
--param=promote-statics=1, as they must be aware of this.

Currently, partitioner 2. do not account the number of nodes to be
promoted.  Implementing this certainly will reduce impact on produced
code.

## Jobserver Integration

We implemented a interface to communicate with the GNU Make's Jobserver
that is able to detect when the GNU Make Jobserver is active, thanks to
Nathan Sidwell. This works as follows:

When -fparallel-jobs=jobserver is provided, GCC will try to detect if
there is a running Jobserver in which we can communicate to. If true,
we return the token that Make originally gave to us, then we wait for
make for a new token that, when provided, will launch a forked child
process with the partition information; or fall back to default
compilation if Jobserver is not detected with a warning. Now if
-fparallel-jobs=auto, this warning will not be provided and the default
number of jobs will be set to 2, unless a single core CPU is detected.
For more information about the implementation, check gcc/jobserver.cc
file in the autopar_devel branch.

## Speedups

Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U,
and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376.
As a comparison, the theoretical maximum speedup by parallelizing this
part of compilation is 4x, so there are things that can be improved
here.

Results are shown in the following table. There are three columns:

* Sequential, which means runn

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Giuliano Belinassi via Gcc
Hi, Richi.

On 08/31, Richard Biener wrote:
> On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
>  wrote:
> >
> > Hi,
> >
> > This is the final report of the "Automatic Parallel Compilation
> > Viability" project.  Please notice that this report is pretty
> > similar to the delivered from the 2nd evaluation, as this phase
> > consisted of mostly rebasing and bug fixing.
> >
> > Please, reply this message for any question or suggestion.
> 
> Thank you for your great work Giuliano!

Thank you :)

> 
> It's odd that LTO emulated parallelism is winning here,
> I'd have expected it to be slower.  One factor might
> be different partitioning choices and the other might
> be that the I/O required is faster than the GC induced
> COW overhead after forking.  Note you can optimize
> one COW operation by re-using the main process for
> compiling the last partition.  I suppose you tested
> this on a system with a fast SSD so I/O overhead is
> small?

The Core-i7 machine runs on a fast NVMe SSD.  The Opteron machine seems
to run on a RAID setup with conventional HDDs, but I am not sure how it
is configured, as I have no phisical access to that machine.

Thank you,
Giuliano

> 
> Thanks again,
> Richard.
> 
> > Thank you,
> > Giuliano.
> >
> > --- 8< ---
> >
> > # Automatic Parallel Compilation Viability: Final Report
> >
> > ## Complete Tasks
> >
> > For the third evaluation, we expected to deliver the product as a
> > series of patches for trunk.  The patch series were in fact delivered
> > [1], but several items must be fixed before merge.
> >
> >
> > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > Bootstrap is working, and therefore this can be used in an experimental
> > state.
> >
> > ## How to use
> >
> > 1. Clone the autopar_devel branch:
> > ```
> > git clone --single-branch --branch devel/autopar_devel \
> >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > ```
> > 2. Follow the standard compilation options provided in the Compiling
> > GCC page, and install it on some directory. For instance:
> >
> > ```
> > cd gcc_autopar_devel
> > mkdir build && cd build
> > ../configure --disable-bootstrap --enable-languages=c,c++
> > make -j 8
> > make DESTDIR=/tmp/gcc11_autopar install
> > ```
> >
> > 3. If you want to test whether your version is working, just launch
> > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> >
> > 5. If you want to compile a project with this version it uses GNU
> > Makefiles, you must modify the compilation rule command and prepend a
> > `+` token to it. For example, in Git's Makefile, Change:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > to:
> > ```
> > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > $(EXTRA_CPPFLAGS) $<
> > ```
> > as well as point the CC variable to the installed gcc, and
> > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> >
> > # How the parallelism works in this project
> >
> > In LTO, the Whole Program Analysis decides how to partition the
> > callgraph for running the LTRANS stage in parallel.  This project
> > works very similar to this, however with some changes.
> >
> > The first was to modify the LTO structure so that it accepts
> > the compilation without IR streaming to files.  This avoid an IO
> > overhead when compiling in parallel.
> >
> > The second was to use a custom partitioner to find which nodes
> > should be in the same partition.  This was mainly done to bring COMDAT
> > together, as well as symbols that are part of other symbols, and even
> > private symbols so that we do not output hidden global symbols.
> >
> > However, experiment showed that bringing private symbols together did
> > not yield a interesting speedup on some large files, and therefore
> > we implemented two modes of partitioning:
> >
> > 1. Partition without static promotion. This is the safer method to use,
> > as we do not modify symbols in the Compilation Unit. This may lead to
> > speedups in files that have multiple entries points with low
> > connectivity between then (such as insn-emit.c), however this will not
> > provide speedups when this hypothesis is not true (gimple-match.c is an
> > example of this). This is the default mode.
> >
> > 2. Partition with static promotion to global. This is a more aggressive
> > method, as we can decide to promote some functions to global to increase
> > parallelism opportunity. This also will change the final assembler name
> > of the promoted function to avoid collision with functions of others
> > Compilation Units. To use this mode, the user has to manually specify
> > --param=promote-statics=1, as they must be aware of this.
> >
> > Currently, partitioner 2. do not account the number of nodes to be
> > promoted.  Implementing this certainly will reduce impact on produced
> > code.

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

2020-08-31 Thread Giuliano Belinassi via Gcc
Hi, Richi.

On 08/31, Richard Biener wrote:
> On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka  wrote:
> >
> > > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > This is the final report of the "Automatic Parallel Compilation
> > > > Viability" project.  Please notice that this report is pretty
> > > > similar to the delivered from the 2nd evaluation, as this phase
> > > > consisted of mostly rebasing and bug fixing.
> > > >
> > > > Please, reply this message for any question or suggestion.
> > >
> > > Thank you for your great work Giuliano!
> >
> > Indeed, it is quite amazing work :)
> > >
> > > It's odd that LTO emulated parallelism is winning here,
> > > I'd have expected it to be slower.  One factor might
> > > be different partitioning choices and the other might
> > > be that the I/O required is faster than the GC induced
> > > COW overhead after forking.  Note you can optimize
> > > one COW operation by re-using the main process for
> > > compiling the last partition.  I suppose you tested
> > > this on a system with a fast SSD so I/O overhead is
> > > small?
> >
> > At the time I implemented fork based parallelism for WPA (which I think
> > we could recover by bit generalizing guiliano's patches), I had same
> > outcome: forked ltranses was simply running slower than those after
> > streaming.  This was however tested on Firefox in my estimate sometime
> > around 2013. I never tried it on units comparable to insn-emit (which
> > would be differnt at that time anyway). I was mostly aiming to get it
> > fully transparent with streaming but never quite finished it since, at
> > that time, it I tought time is better spent on optimizing LTO data
> > layout.
> >
> > I suppose we want to keep both mechanizms in both WPA and normal
> > compilation and make compiler to choose fitting one.
> 
> I repeated Giulianos experiment on gimple-match.ii and
> producing LTO bytecode takes 5.3s and the link step
> 9.5s with two jobs, 6.6s with three and 5.0s with four
> and 2.4s with 32.
> 
> With -fparallel-jobs=N and --param promote-statics=1 I
> see 14.8s, 13.9s and 13.5s here.  With 8 jobs the reduction
> is to 11s.
> 
> It looks like LTO much more aggressively partitions
> this - I see 36 partitions generated for gimple-match.c
> while -fparallel-jobs creates "only" 27.  -fparallel-jobs
> doesn't seem to honor the various lto-partition
> --params btw?  The relevant ones would be
> --param lto-partitions (the max. number of partitions
> to generate) and --param lto-min-partition
> (the minimum size for a partition).  I always thought
> that lto-min-partition should be higher for
> -fparallel-jobs (which I envisioned to be enabled by
> default).

There is a partition balancing mechanism that can be disabled
with --param=balance-partitions=0.

Assuming that you used -fparallel-jobs=32, it may be possible
that it merged small partitions until it reached the average
size of max_size / 33, which resulted in 27 partitions.

The only lto parameter that I use is --param=lto-min-partition
controlling the minimum size in which that it will run
in parallel. 

> 
> I guess before investigating the current state in detail
> it might be worth exploring Honzas wish of sharing
> the actual partitioning code between LTO and -fparallel-jobs.
> 
> Note that larger objects take a bigger hit from the GC COW
> issue so at some point that becomes dominant because the
> first GC walk for each partition is the same as doing a GC
> walk for the whole object.  Eventually it makes sense to
> turn off GC completely for smaller partitions.

Just a side note, I added a ggc_collect () before start forking
and it did not improve things.

Thank you,
Giuliano.

> 
> Richard.
> 
> > Honza
> > >
> > > Thanks again,
> > > Richard.
> > >
> > > > Thank you,
> > > > Giuliano.
> > > >
> > > > --- 8< ---
> > > >
> > > > # Automatic Parallel Compilation Viability: Final Report
> > > >
> > > > ## Complete Tasks
> > > >
> > > > For the third evaluation, we expected to deliver the product as a
> > > > series of patches for trunk.  The patch series were in fact delivered
> > > > [1], but several items must be fixed before merge.
> > > >
> > > >
> > > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > > Bootstrap is working, and therefore this can be used in an experimental
> > > > state.
> > > >
> > > > ## How to use
> > > >
> > > > 1. Clone the autopar_devel branch:
> > > > ```
> > > > git clone --single-branch --branch devel/autopar_devel \
> > > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > > ```
> > > > 2. Follow the standard compilation options provided in the Compiling
> > > > GCC page, and install it on some directory. For instance:
> > > >
> > > > ```
> > > > cd gcc_autopar_devel
> > > > mkdir build && cd build
> > > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > > make -j 8
> > > > make DESTDIR=/tmp/gcc11_autopar install
> > > > ```
> > > >
> >

Re: [PATCH] Try LTO partial linking. (Was: Speed of compiling gimple-match.c)

2021-06-12 Thread Giuliano Belinassi via Gcc
Hi, all.

Please CC me when I am mentioned into a mail.

On Thu, 2021-05-20 at 15:16 +0200, Richard Biener via Gcc wrote:
> On Thu, May 20, 2021 at 3:06 PM Martin Liška  wrote:
> > 
> > On 5/20/21 2:54 PM, Richard Biener wrote:
> > > So why did you go from applying this per-file to multiple files?
> > 
> > When I did per-file for {gimple,generic}-match.c I hit the
> > following issue with lto.priv symbols:
> > 
> > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-
> > linux/bin/ld: error: libbackend.a(generic-match.o): multiple
> > definition of 'wi::to_wide(tree_node const*) [clone .part.0] [clone
> > .lto_priv.0]'
> > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-
> > linux/bin/ld: libbackend.a(gimple-match.o): previous definition
> > here
> > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-
> > linux/bin/ld: error: libbackend.a(generic-match.o): multiple
> > definition of 'TYPE_VECTOR_SUBPARTS(tree_node const*) [clone
> > .part.0] [clone .lto_priv.0]'
> > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-
> > linux/bin/ld: libbackend.a(gimple-match.o): previous definition
> > here
> > /usr/lib64/gcc/x86_64-suse-linux/10/../../../../x86_64-suse-
> > linux/bin/ld: error: libbackend.a(generic-match.o): multiple
> > definition of 'vec > vl_embed>::operator[](unsigned int) [clone .part.0] [clone
> > .lto_priv.0]'
> > 
> > Any idea what was I doing wrong?
> 
> Nothing in particular I think - you're just hitting the issue that
> LTO
> produces new symbols and that those
> can obviously clash.  Giuliano hit the very same issue.  When not
> doing partial links those internal
> symbols pose no problem, but with -r -flinker-output=nolto-rel and
> re-linking the produced objects
> they obviously do.  ELF has no solution for this though, but I think
> we could strip those from the
> partially linked object - if WPA would give us a list of objects the
> link step could postprocess
> the object with objcopy or maybe a custom linker script could do the
> trick as well.

I've "fixed" this issue in my branch by mangling any promoted to public
symbol. I've also disabled the "ipa-split" pass in the paper branch
because of some created symbols which I was not able to fix in time.
Perhaps this goes away if you disable it.

Perhaps we should work on getting the autopar branch merged into trunk.
There are several issues which must be fixed and I don't think it will
be ready for this next release. The main ones that I remember from the
top of my head:

1- Fix the driver to use SPEC language for the multiple required calls
to `as`, instead of injecting code for that directly on the `void
execute()` function.

2- Merge my custom partitioner for using the default LTO partitioner.
The default LTO partitioner were hitting the assertions about COMDAT
being split into multiple partitions.

3- Fix the issue with the ipa-split pass.

Perhaps we should further explore avoiding partial linking altogether
and concat the assembler files.

Thank you,
Giuliano.

> 
> So your workaround is to only ever have a single LTO produced object
> file participating in the
> final links ;)
> 
> Richard.
> 
> > 
> > Martin