Hi,

Parallelize GCC with Threads -- First Evaluation

Hi everyone,

I am attaching the first evaluation report here publicly for gathering
feedback. The file is in markdown format and it can be easily be converted to
PDF for better visualization.

I am also open to suggestions and ideas in order to improve the current project 
:-)

My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel

Giuliano
# Parallelize GCC with Threads: First Evaluation Report


## Tasks accomplished until the First Evaluation:

These are the tasks which I accomplished until now:

  - Refactor `cgraph_node::expand()` and `expand_all_functions`, splitting IPA,
GIMPLE and RTL passes

  - Some documentation about the global states of the GIMPLE passes.

And therefore accomplishing all itens planned to the first evaluation.

Here I explain some technical challenges I was put through and architectural
decisions that I've made.


## Chapter 1 -- Splitting `node::expand`.

This chapter discusses the the technical issues found when splitting the
`node::expand` into `node::expand_ipa`, `node::expand`, and `node::expand_rtl`.

The first step was to split `all_passes` into `all_passes` and
`all_rtl_passes`. This was straightforward, I had to modify the `passes.def`
file, add a new pass list there and call the constructor in `passes.c`.
Unfortunately I could not reuse the `pass_rest_of_compilation`, as sugested by
Richi, therefore I've created a new pass.

The next step was to split `node::expand` into `node::expand_ipa`,
`node::expand` and `node::expand_rtl`. These names are according to the folowing
distinction:

  - `expand_ipa`, which are related with the Inter Process Analysis (IPA).
  - `expand`: which are Intra Process GIMPLE passes. (`all_passes`)
  - `expand_rtl`: which are Intra Process RTL passes (`all_rtl_passes`).

This was somewhat straightforward, since I just had to store the
`location_t input_location` in the `struct function`.

Unfortunately, executing `node::expand_ipa` in every function before
`node::expand`, and `node::expand_rtl` on every function after `node::expand`
was way harder since I had to modify lots of things such as:

   1. The `default_obstack`: There is a pass called `pass_fixup_cfg` which is
      called in both `pass_local_optimization_pases` and `all_passes`. This
      pass uses a global shared object called `default_obstack`, which is
      modified every time it runs into a new function. The solution was to
      create a local obstack to each function, storing it in the `struct
      function`. Unfortunately, this broke several tests related to IPA, which
      made me define a new compiler state called `EXPAND_IPA`, then select the
      default obstack when the compiler is in this state, or the local obstack
      when it is in EXPAND state, fixing the issue.

   2. Adjust the cfun stack to push the cfun before executing the passes, and
poping it when it finishes. I've done it to `node::expand_ipa`, `node::expand`,
and `node::expand_rtl`

   3. Set the `gimple_register_cfg_hooks` when node::expand is called

   4. Store the `input_location` in the `struct function`

This allowed me to refactor the loop in `expand_all_functions`:
```
	FOR_EACH_CGRAPH_NODE(node) {
        node->expand_ipa();
        node->expand();
        node->expand_rtl();
    }
```
into:
```
    FOR_EACH_CGRAPH_NODE(node)
        node->expand_ipa();
    FOR_EACH_CGRAPH_NODE(node)
        node->expand();
    FOR_EACH_CGRAPH_NODE(node)
        node->expand_rtl();
```
Enabling me to proceed with the parallelization.


## Chapter 2: -- Parallelization Decisions

After the work done in Chapter 1, I started working on how I will parallelize
the compiler. Richi suggested a pipelined architecture: split the passes into
threads, and pass the cfun across it. When one thread finishes executing its
pass to one cfun, it sends the cfun to the next thread, which holds the next
pass, until all passes are over. This has the advantage of not requiring a
per-pass analysis of global states but has the disadvantage of being slower
(just imagine some passes being slower than every other passes). What I propose
is to compile functions in parallel. Since at this point we are executing
Intra Process optimizations, this is theoretically possible. The issue is that
this is way harder to implement since it requires looking into the global
states of every local pass.

With this in mind, I created a `worker_set` structure that implements a
producer-consumer threadsafe queue: One thread (the master) inserts all cfuns
into a queue, and the other threads (the workers) are responsible to remove its
the cfuns from the queue and start processing. When a thread finishes its work,
it will automatically remove the next function from the queue until it finds a
NULL function, which represents that there is no more work in the queue.
Regarding implementations details, I am using pthreads and unix semaphores.

After this structure was finished, I measured the spent time of inserting and
removing 2000 functions (which is more than the 1700 functions in gimple-match.c)
and measured 0.1 seconds, and did a stress test to ensure correctness.

Then I started mapping the global variables. In order to find them I am using
the following methodology:

   1. Set number of threads to equal 2.
   2. Enable a barrier in passes.c, trying to force one passe to run
      simultaneously in all threads.
   3. Wait the compiler to crash.
   4. Look at the crash backtrace and look for references to static variables.
   5. Add `__thread` to the variable
   6. Set number of threads to equal 1.
   7. Run tests. If they pass, GOTO 1.; Else do a more careful look about
      why it crashed.

With this methodology, I have mapped the following global variables:

|       file name            |                 variable                      |
|----------------------------|-----------------------------------------------|
|    function.c:             |     cfg_hooks                                 |
|    function.c:             |     cfun                                      |
|    function.c:             |     cfun_stack                                |
|    function.c:             |     parse_alignment_opts () //ignored         |
|    function.c:             |     targetm // Backend dependency, ignored    |
|                            |                                               |
|    input.c:                |     input_location                            |
|                            |                                               |
|    toplev.c:               |     current_function_decl                     |
|    toplev.c:               |     current_fuinction_func_begin_label        |
|                            |                                               |
|    tree-cfgcleanup.c:      |     cfgcleanup_altered_bbs                    |
|                            |                                               |
|    fold-const.c:           |     folding_deferring_overflow_warnings       |
|    fold-const.c:           |     fold_deferred_overflow_warning            |
|    fold-const.c:           |     warn_strict_overflow_code                 |
|                            |                                               |
|    tree-ssa-operands.c:    |      build_uses                               |
|    tree-ssa-operands.c:    |     build_vdef                                |
|    tree-ssa-operands.c:    |     build_vuse                                |
|    tree-ssa-operands.c:    |     operands_bitmap_obstack                   |
    tree-ssa-operands.c:    |     n_initialized                             |
|                            |                                               |
|    tree-ssa-propagate.c:   |     cfg_bloks                                 |
|    tree-ssa-propagate.c:   |     cfg_bloks_back                            |
|    tree-ssa-propagate.c:   |     bb_to_cfg_order                           |
|    tree-ssa-propagate.c:   |     cfg_order_to_bb                           |
|    tree-ssa-propagate.c:   |     ssa_edge_to_worklist                      |
|    tree-ssa-propagate.c:   |     ssa_edge_worklist_back                    |
|    tree-ssa-propagate.c:   |     uid_to_stmt                               |
|    tree-ssa-propagate.c:   |     curr_order                                |
|                            |                                               |
|    tree-cfg.c:             |     edge_to_cases                             |
|    tree-cfg.c:             |     touched_switch_bbs                        |
|    tree-cfg.c:             |     cfg_stats                                 |
|    tree-cfg.c:             |     discriminator_per_locus                   |
|    tree-ssa-ccp.c:         |     const_val                                 |
|    tree-ssa-ccp.c:         |     n_const_val                               |
|                            |                                               |
|    tree-ssa-forwprop.c:    |     cfg_changed                               |
|    tree-ssa-forwprop.c:    |     to_purge                                  |
|    tree-ssa-forwprop.c:    |     lattice                                   |
|                            |                                               |
|    et-forest.c:            |     et_nodes                                  |
|    et-forest.c:            |     et_occurrences                            |
|                            |                                               |
|    optabs-query.c:         |     this_fn_optabs                            |
|    optabs-query.c:         |     this_target_optabs                        |
|                            |                                               |
|    tree-ssa-structalias.c: |     use_field_sensitive                       |
|    tree-ssa-structalias.c: |     predbitmap_obstack                        |
|    tree-ssa-structalias.c: |     oldpta_obstack                            |
|    tree-ssa-structalias.c: |     iteration_obstack                         |
|    tree-ssa-structalias.c: |     constraint_stats                          |
|    tree-ssa-structalias.c: |     variable_pool_info                        |
|    tree-ssa-structalias.c: |     final_solutions                           |
|    tree-ssa-structalias.c: |     final_solutions_obstack                   |
|    tree-ssa-structalias.c: |     varmap                                    |
|    tree-ssa-structalias.c: |     call_stmt_vars                            |
|    tree-ssa-structalias.c: |     constrants                                |
|    tree-ssa-structalias.c: |     constraint_pool                           |
|    tree-ssa-structalias.c: |     graph                                     |
|    tree-ssa-structalias.c: |     pointer_equiv_class_table                 |
|    tree-ssa-structalias.c: |     location_equiv_class_table                |
|    tree-ssa-structalias.c: |     pointer_equiv_class                       |
|    tree-ssa-structalias.c: |     vi_for_tree                               |
|    tree-ssa-structalias.c: |     shared_bitmap_table                       |


 Of course, this algorithm will fail eventually. When this happens, I will use
helgrind to check for race condition in some variables. The point is that
helgrind is slow and therefore I am currently trying to avoid it to save time.

## Chapter 3: Current Issues

Here I report the current issues that I am having with the project. There are
manly two issues that I am facing.

1. The GCC `object_pool_allocator`

    There is also the GCC object_pool_allocator, which is used to allocate some
    objects. Since these objects may be used later in the compilation by other
    threads, we can't simply make them private to each thread. Therefore I added a
    threadsafe_object_pool_allocator object that currently uses locks guarantee
    safety, however I am not able to check its correctness. This is also
    not efficient and might require a better approach later.

2. The `default_obstack` in `bitmap.c`

    As mentioned before, there is a default obstack which seems to be used in
    many parts of the compiler. When trying to make it local to each thread, I
    am facing crashes in the LTO infrastructure with several tests failing. I
    could not find the reason of the failure yet. This is the main issue that
    prevents me from proceding


Reply via email to