Re: Parallelize GCC with Threads -- First Evaluation

Richard Biener Tue, 25 Jun 2019 06:26:52 -0700

On Tue, 25 Jun 2019, Giuliano Belinassi wrote:

> Hi, Richard
> 
> On 06/25, Richard Biener wrote:
> > On Mon, 24 Jun 2019, Giuliano Belinassi wrote:
> > 
> > > Hi,
> > > 
> > > Parallelize GCC with Threads -- First Evaluation
> > > 
> > > Hi everyone,
> > > 
> > > I am attaching the first evaluation report here publicly for gathering
> > > feedback. The file is in markdown format and it can be easily be 
> > > converted to
> > > PDF for better visualization.
> > > 
> > > I am also open to suggestions and ideas in order to improve the current 
> > > project :-)
> > > 
> > > My branch can be seen here: 
> > > https://gitlab.com/flusp/gcc/tree/giulianob_parallel
> > 
> > Thanks for your work on this!
> > 
> > I didn't think of the default_obstack and input_location issues
> > so it's good we get to know these.  The bitmap default obstack
> > is merely a convenience and fortunately (content!) lifetime is constrained
> > to be per-pass though that's not fully enforced.  Your solution is fine
> > I think and besides when parallelizing the IPA phase should be not
> > worse than making the default-obstack per thread context given how
> > many times we zap it.
> 
> This is something that is bothering me. In this current state, when I
> declare the default_obstack as TLS, several LTO tests fails in
> ssa_verify  and I am not sure why yet. I will investigate futher in
> this.


If you tell me how to reproduce on your branch I can have a look as well.

> As for IPA phase, you mean the expand_ipa or the IPA itself? If
> expand_ipa falls in Inter Process otimization scheme the current
> strategy will work, of course.

I meant IPA itself.

> As for IPA itself, I am still not sure if it is parallelizable yet,
> however, I've seen some academic works parallelizing Dataflow
> Analysis, which can give interesting ideas about how to handle the call
> graphs (?). But this may be something interesting as the LTO WHOPR is
> entirely sequential AFIK.
> 
> > 
> > input_location shouldn't really be used...  but oh well.
> > 
> > As of per-pass global state I expect that eventually all
> > global variables in the individual passes are a problem
> > and in nearly all cases they are global out of laziness
> > to pass down state across functions.
> > 
> > You identified some global state in infrastructure code
> > which are the more interesting cases, most relevant for
> > GIMPLE are the ones in tree-ssa-operands.c, tree-cfg.c
> > and et-forest.c I guess.
> > 
> > For individual passes a first step would be to wrap
> > all globals into a struct we can allocate at ::execute
> > time and pass down as pointer.  That's slightly less
> > intrusive than wrapping all of the pass in a class
> > but functionally equivalent.
> 
> Sounds really good, but this will also require a lot of code change.

Yeah, it's really a monkeys job ;)  One reason I originally proposed
the pipeline approach to avoid the need to fix all of these.

> Hopefully most of it can be solved without any unpleasant surprises.

I would guess so, it's also a generally desirable cleanup.

> >
> > <<<
> > 1. The GCC `object_pool_allocator`
> > 
> >     There is also the GCC object_pool_allocator, which is used to allocate 
> > some
> >     objects. Since these objects may be used later in the compilation by 
> > other
> >     threads, we can't simply make them private to each thread. Therefore I 
> > added a
> >     threadsafe_object_pool_allocator object that currently uses locks 
> > guarantee
> >     safety, however I am not able to check its correctness. This is also
> >     not efficient and might require a better approach later.
> > >>>
> > 
> > I guess the same applies to the GC allocator - to make these
> > more efficient we'd have a per-thread freelist we can allocate
> > from without locking and which we'd, once empty, fill from the
> > main pool in larger chunks with locking.  At thread finalization
> > we have to return the freelist to the main allocator then
> > and for the GC allocator possibly at garbage collection time.
> 
> This also sounds good. If I understeand correctly, the sequential
> allocation state will be amortized, as we could ask for an object which
> is 2x bigger than each thread currently has. (I am assuming allocation
> is O(1))

Yeah, it's mostly trading (freelist) memory for speed.

> > 
> > This scheme may also work for the bitmap default obstack
> > and its freelist.  I would also suggest when you run into
> > a specific issue with the default obstack to use a separate
> > obstack in the respective area.  You mention issues with LTO,
> > are the reproducible on the branch?  I suppose you are
> > currently testing GCC with num_threads = 1 to make sure you
> > are not introducing non-threading related issues?
> 
> Yes. The issue with LTO is that obstack, as I mentioned earlier. It is
> reproducible in my branch, just compile it without any changes, then
> declare the obstack TLS, recompile the changes and run gcc.dg tests.

Trying that but cannot reproduce any failure sofar.  Did you adjust
bitmap_default_obstack_depth to be TLS as well?

Richard.

Re: Parallelize GCC with Threads -- First Evaluation

Reply via email to