On Tue, 25 Jun 2019, Giuliano Belinassi wrote: > Hi, Richard > > On 06/25, Richard Biener wrote: > > On Mon, 24 Jun 2019, Giuliano Belinassi wrote: > > > > > Hi, > > > > > > Parallelize GCC with Threads -- First Evaluation > > > > > > Hi everyone, > > > > > > I am attaching the first evaluation report here publicly for gathering > > > feedback. The file is in markdown format and it can be easily be > > > converted to > > > PDF for better visualization. > > > > > > I am also open to suggestions and ideas in order to improve the current > > > project :-) > > > > > > My branch can be seen here: > > > https://gitlab.com/flusp/gcc/tree/giulianob_parallel > > > > Thanks for your work on this! > > > > I didn't think of the default_obstack and input_location issues > > so it's good we get to know these. The bitmap default obstack > > is merely a convenience and fortunately (content!) lifetime is constrained > > to be per-pass though that's not fully enforced. Your solution is fine > > I think and besides when parallelizing the IPA phase should be not > > worse than making the default-obstack per thread context given how > > many times we zap it. > > This is something that is bothering me. In this current state, when I > declare the default_obstack as TLS, several LTO tests fails in > ssa_verify and I am not sure why yet. I will investigate futher in > this.
If you tell me how to reproduce on your branch I can have a look as well. > As for IPA phase, you mean the expand_ipa or the IPA itself? If > expand_ipa falls in Inter Process otimization scheme the current > strategy will work, of course. I meant IPA itself. > As for IPA itself, I am still not sure if it is parallelizable yet, > however, I've seen some academic works parallelizing Dataflow > Analysis, which can give interesting ideas about how to handle the call > graphs (?). But this may be something interesting as the LTO WHOPR is > entirely sequential AFIK. > > > > > input_location shouldn't really be used... but oh well. > > > > As of per-pass global state I expect that eventually all > > global variables in the individual passes are a problem > > and in nearly all cases they are global out of laziness > > to pass down state across functions. > > > > You identified some global state in infrastructure code > > which are the more interesting cases, most relevant for > > GIMPLE are the ones in tree-ssa-operands.c, tree-cfg.c > > and et-forest.c I guess. > > > > For individual passes a first step would be to wrap > > all globals into a struct we can allocate at ::execute > > time and pass down as pointer. That's slightly less > > intrusive than wrapping all of the pass in a class > > but functionally equivalent. > > Sounds really good, but this will also require a lot of code change. Yeah, it's really a monkeys job ;) One reason I originally proposed the pipeline approach to avoid the need to fix all of these. > Hopefully most of it can be solved without any unpleasant surprises. I would guess so, it's also a generally desirable cleanup. > > > > <<< > > 1. The GCC `object_pool_allocator` > > > > There is also the GCC object_pool_allocator, which is used to allocate > > some > > objects. Since these objects may be used later in the compilation by > > other > > threads, we can't simply make them private to each thread. Therefore I > > added a > > threadsafe_object_pool_allocator object that currently uses locks > > guarantee > > safety, however I am not able to check its correctness. This is also > > not efficient and might require a better approach later. > > >>> > > > > I guess the same applies to the GC allocator - to make these > > more efficient we'd have a per-thread freelist we can allocate > > from without locking and which we'd, once empty, fill from the > > main pool in larger chunks with locking. At thread finalization > > we have to return the freelist to the main allocator then > > and for the GC allocator possibly at garbage collection time. > > This also sounds good. If I understeand correctly, the sequential > allocation state will be amortized, as we could ask for an object which > is 2x bigger than each thread currently has. (I am assuming allocation > is O(1)) Yeah, it's mostly trading (freelist) memory for speed. > > > > This scheme may also work for the bitmap default obstack > > and its freelist. I would also suggest when you run into > > a specific issue with the default obstack to use a separate > > obstack in the respective area. You mention issues with LTO, > > are the reproducible on the branch? I suppose you are > > currently testing GCC with num_threads = 1 to make sure you > > are not introducing non-threading related issues? > > Yes. The issue with LTO is that obstack, as I mentioned earlier. It is > reproducible in my branch, just compile it without any changes, then > declare the obstack TLS, recompile the changes and run gcc.dg tests. Trying that but cannot reproduce any failure sofar. Did you adjust bitmap_default_obstack_depth to be TLS as well? Richard.