Hi, Parallelize GCC with Threads -- First Evaluation
Hi everyone, I am attaching the first evaluation report here publicly for gathering feedback. The file is in markdown format and it can be easily be converted to PDF for better visualization. I am also open to suggestions and ideas in order to improve the current project :-) My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel Giuliano
# Parallelize GCC with Threads: First Evaluation Report ## Tasks accomplished until the First Evaluation: These are the tasks which I accomplished until now: - Refactor `cgraph_node::expand()` and `expand_all_functions`, splitting IPA, GIMPLE and RTL passes - Some documentation about the global states of the GIMPLE passes. And therefore accomplishing all itens planned to the first evaluation. Here I explain some technical challenges I was put through and architectural decisions that I've made. ## Chapter 1 -- Splitting `node::expand`. This chapter discusses the the technical issues found when splitting the `node::expand` into `node::expand_ipa`, `node::expand`, and `node::expand_rtl`. The first step was to split `all_passes` into `all_passes` and `all_rtl_passes`. This was straightforward, I had to modify the `passes.def` file, add a new pass list there and call the constructor in `passes.c`. Unfortunately I could not reuse the `pass_rest_of_compilation`, as sugested by Richi, therefore I've created a new pass. The next step was to split `node::expand` into `node::expand_ipa`, `node::expand` and `node::expand_rtl`. These names are according to the folowing distinction: - `expand_ipa`, which are related with the Inter Process Analysis (IPA). - `expand`: which are Intra Process GIMPLE passes. (`all_passes`) - `expand_rtl`: which are Intra Process RTL passes (`all_rtl_passes`). This was somewhat straightforward, since I just had to store the `location_t input_location` in the `struct function`. Unfortunately, executing `node::expand_ipa` in every function before `node::expand`, and `node::expand_rtl` on every function after `node::expand` was way harder since I had to modify lots of things such as: 1. The `default_obstack`: There is a pass called `pass_fixup_cfg` which is called in both `pass_local_optimization_pases` and `all_passes`. This pass uses a global shared object called `default_obstack`, which is modified every time it runs into a new function. The solution was to create a local obstack to each function, storing it in the `struct function`. Unfortunately, this broke several tests related to IPA, which made me define a new compiler state called `EXPAND_IPA`, then select the default obstack when the compiler is in this state, or the local obstack when it is in EXPAND state, fixing the issue. 2. Adjust the cfun stack to push the cfun before executing the passes, and poping it when it finishes. I've done it to `node::expand_ipa`, `node::expand`, and `node::expand_rtl` 3. Set the `gimple_register_cfg_hooks` when node::expand is called 4. Store the `input_location` in the `struct function` This allowed me to refactor the loop in `expand_all_functions`: ``` FOR_EACH_CGRAPH_NODE(node) { node->expand_ipa(); node->expand(); node->expand_rtl(); } ``` into: ``` FOR_EACH_CGRAPH_NODE(node) node->expand_ipa(); FOR_EACH_CGRAPH_NODE(node) node->expand(); FOR_EACH_CGRAPH_NODE(node) node->expand_rtl(); ``` Enabling me to proceed with the parallelization. ## Chapter 2: -- Parallelization Decisions After the work done in Chapter 1, I started working on how I will parallelize the compiler. Richi suggested a pipelined architecture: split the passes into threads, and pass the cfun across it. When one thread finishes executing its pass to one cfun, it sends the cfun to the next thread, which holds the next pass, until all passes are over. This has the advantage of not requiring a per-pass analysis of global states but has the disadvantage of being slower (just imagine some passes being slower than every other passes). What I propose is to compile functions in parallel. Since at this point we are executing Intra Process optimizations, this is theoretically possible. The issue is that this is way harder to implement since it requires looking into the global states of every local pass. With this in mind, I created a `worker_set` structure that implements a producer-consumer threadsafe queue: One thread (the master) inserts all cfuns into a queue, and the other threads (the workers) are responsible to remove its the cfuns from the queue and start processing. When a thread finishes its work, it will automatically remove the next function from the queue until it finds a NULL function, which represents that there is no more work in the queue. Regarding implementations details, I am using pthreads and unix semaphores. After this structure was finished, I measured the spent time of inserting and removing 2000 functions (which is more than the 1700 functions in gimple-match.c) and measured 0.1 seconds, and did a stress test to ensure correctness. Then I started mapping the global variables. In order to find them I am using the following methodology: 1. Set number of threads to equal 2. 2. Enable a barrier in passes.c, trying to force one passe to run simultaneously in all threads. 3. Wait the compiler to crash. 4. Look at the crash backtrace and look for references to static variables. 5. Add `__thread` to the variable 6. Set number of threads to equal 1. 7. Run tests. If they pass, GOTO 1.; Else do a more careful look about why it crashed. With this methodology, I have mapped the following global variables: | file name | variable | |----------------------------|-----------------------------------------------| | function.c: | cfg_hooks | | function.c: | cfun | | function.c: | cfun_stack | | function.c: | parse_alignment_opts () //ignored | | function.c: | targetm // Backend dependency, ignored | | | | | input.c: | input_location | | | | | toplev.c: | current_function_decl | | toplev.c: | current_fuinction_func_begin_label | | | | | tree-cfgcleanup.c: | cfgcleanup_altered_bbs | | | | | fold-const.c: | folding_deferring_overflow_warnings | | fold-const.c: | fold_deferred_overflow_warning | | fold-const.c: | warn_strict_overflow_code | | | | | tree-ssa-operands.c: | build_uses | | tree-ssa-operands.c: | build_vdef | | tree-ssa-operands.c: | build_vuse | | tree-ssa-operands.c: | operands_bitmap_obstack | tree-ssa-operands.c: | n_initialized | | | | | tree-ssa-propagate.c: | cfg_bloks | | tree-ssa-propagate.c: | cfg_bloks_back | | tree-ssa-propagate.c: | bb_to_cfg_order | | tree-ssa-propagate.c: | cfg_order_to_bb | | tree-ssa-propagate.c: | ssa_edge_to_worklist | | tree-ssa-propagate.c: | ssa_edge_worklist_back | | tree-ssa-propagate.c: | uid_to_stmt | | tree-ssa-propagate.c: | curr_order | | | | | tree-cfg.c: | edge_to_cases | | tree-cfg.c: | touched_switch_bbs | | tree-cfg.c: | cfg_stats | | tree-cfg.c: | discriminator_per_locus | | tree-ssa-ccp.c: | const_val | | tree-ssa-ccp.c: | n_const_val | | | | | tree-ssa-forwprop.c: | cfg_changed | | tree-ssa-forwprop.c: | to_purge | | tree-ssa-forwprop.c: | lattice | | | | | et-forest.c: | et_nodes | | et-forest.c: | et_occurrences | | | | | optabs-query.c: | this_fn_optabs | | optabs-query.c: | this_target_optabs | | | | | tree-ssa-structalias.c: | use_field_sensitive | | tree-ssa-structalias.c: | predbitmap_obstack | | tree-ssa-structalias.c: | oldpta_obstack | | tree-ssa-structalias.c: | iteration_obstack | | tree-ssa-structalias.c: | constraint_stats | | tree-ssa-structalias.c: | variable_pool_info | | tree-ssa-structalias.c: | final_solutions | | tree-ssa-structalias.c: | final_solutions_obstack | | tree-ssa-structalias.c: | varmap | | tree-ssa-structalias.c: | call_stmt_vars | | tree-ssa-structalias.c: | constrants | | tree-ssa-structalias.c: | constraint_pool | | tree-ssa-structalias.c: | graph | | tree-ssa-structalias.c: | pointer_equiv_class_table | | tree-ssa-structalias.c: | location_equiv_class_table | | tree-ssa-structalias.c: | pointer_equiv_class | | tree-ssa-structalias.c: | vi_for_tree | | tree-ssa-structalias.c: | shared_bitmap_table | Of course, this algorithm will fail eventually. When this happens, I will use helgrind to check for race condition in some variables. The point is that helgrind is slow and therefore I am currently trying to avoid it to save time. ## Chapter 3: Current Issues Here I report the current issues that I am having with the project. There are manly two issues that I am facing. 1. The GCC `object_pool_allocator` There is also the GCC object_pool_allocator, which is used to allocate some objects. Since these objects may be used later in the compilation by other threads, we can't simply make them private to each thread. Therefore I added a threadsafe_object_pool_allocator object that currently uses locks guarantee safety, however I am not able to check its correctness. This is also not efficient and might require a better approach later. 2. The `default_obstack` in `bitmap.c` As mentioned before, there is a default obstack which seems to be used in many parts of the compiler. When trying to make it local to each thread, I am facing crashes in the LTO infrastructure with several tests failing. I could not find the reason of the failure yet. This is the main issue that prevents me from proceding