On 06/25, Richard Biener wrote:
> On Tue, 25 Jun 2019, Giuliano Belinassi wrote:
> 
> > Hi, Richard
> > 
> > On 06/25, Richard Biener wrote:
> > > On Mon, 24 Jun 2019, Giuliano Belinassi wrote:
> > > 
> > > > Hi,
> > > > 
> > > > Parallelize GCC with Threads -- First Evaluation
> > > > 
> > > > Hi everyone,
> > > > 
> > > > I am attaching the first evaluation report here publicly for gathering
> > > > feedback. The file is in markdown format and it can be easily be 
> > > > converted to
> > > > PDF for better visualization.
> > > > 
> > > > I am also open to suggestions and ideas in order to improve the current 
> > > > project :-)
> > > > 
> > > > My branch can be seen here: 
> > > > https://gitlab.com/flusp/gcc/tree/giulianob_parallel
> > > 
> > > Thanks for your work on this!
> > > 
> > > I didn't think of the default_obstack and input_location issues
> > > so it's good we get to know these.  The bitmap default obstack
> > > is merely a convenience and fortunately (content!) lifetime is constrained
> > > to be per-pass though that's not fully enforced.  Your solution is fine
> > > I think and besides when parallelizing the IPA phase should be not
> > > worse than making the default-obstack per thread context given how
> > > many times we zap it.
> > 
> > This is something that is bothering me. In this current state, when I
> > declare the default_obstack as TLS, several LTO tests fails in
> > ssa_verify  and I am not sure why yet. I will investigate futher in
> > this.
> 
> If you tell me how to reproduce on your branch I can have a look as well.

To reproduce it, clone the branch with:

git clone https://gitlab.com/flusp/gcc.git

Then change branch with

git checkout giulianob_parallel

create a new folder (build_parallel, for instance) and compile it with

../configure --disable-bootstrap --enable-languages=c
make -j<NUM_JOBS>

Then 

```
diff --git a/gcc/bitmap.c b/gcc/bitmap.c
index 894aefa13de..87e6c7ac6fc 100644
--- a/gcc/bitmap.c
+++ b/gcc/bitmap.c
@@ -65,7 +65,7 @@ release_overhead (bitmap b, size_t amount, bool 
remove_from_map)
 
 /* Global data */
 bitmap_element bitmap_zero_bits;  /* An element of all zero bits.  */
-bitmap_obstack bitmap_default_obstack;    /* The default bitmap obstack.  */
+__thread bitmap_obstack bitmap_default_obstack;    /* The default bitmap 
obstack.  */
 static int bitmap_default_obstack_depth;
 static GTY((deletable)) bitmap_element *bitmap_ggc_free; /* Freelist of
            GC'd elements.  */
diff --git a/gcc/bitmap.h b/gcc/bitmap.h
index 39f509db611..b31e99ffcd0 100644
--- a/gcc/bitmap.h
+++ b/gcc/bitmap.h
@@ -359,7 +359,7 @@ struct GTY(()) bitmap_head {
 
 /* Global data */
 extern bitmap_element bitmap_zero_bits; /* Zero bitmap element */
-extern bitmap_obstack bitmap_default_obstack;   /* Default bitmap obstack */
+extern __thread bitmap_obstack bitmap_default_obstack;   /* Default bitmap 
obstack */
 
 /* Change the view of the bitmap to list, or tree.  */
 void bitmap_list_view (bitmap);
```

then recompile.

Just a side note, while I was writting this, Richi found the issue. The
obstack_depth also needs to be marked as TLS. Why it only crashed LTO
is a mistery, through.


> 
> > As for IPA phase, you mean the expand_ipa or the IPA itself? If
> > expand_ipa falls in Inter Process otimization scheme the current
> > strategy will work, of course.
> 
> I meant IPA itself.
> 
> > As for IPA itself, I am still not sure if it is parallelizable yet,
> > however, I've seen some academic works parallelizing Dataflow
> > Analysis, which can give interesting ideas about how to handle the call
> > graphs (?). But this may be something interesting as the LTO WHOPR is
> > entirely sequential AFIK.
> > 
> > > 
> > > input_location shouldn't really be used...  but oh well.
> > > 
> > > As of per-pass global state I expect that eventually all
> > > global variables in the individual passes are a problem
> > > and in nearly all cases they are global out of laziness
> > > to pass down state across functions.
> > > 
> > > You identified some global state in infrastructure code
> > > which are the more interesting cases, most relevant for
> > > GIMPLE are the ones in tree-ssa-operands.c, tree-cfg.c
> > > and et-forest.c I guess.
> > > 
> > > For individual passes a first step would be to wrap
> > > all globals into a struct we can allocate at ::execute
> > > time and pass down as pointer.  That's slightly less
> > > intrusive than wrapping all of the pass in a class
> > > but functionally equivalent.
> > 
> > Sounds really good, but this will also require a lot of code change.
> 
> Yeah, it's really a monkeys job ;)  One reason I originally proposed
> the pipeline approach to avoid the need to fix all of these.
> 
> > Hopefully most of it can be solved without any unpleasant surprises.
> 
> I would guess so, it's also a generally desirable cleanup.
> 
> > >
> > > <<<
> > > 1. The GCC `object_pool_allocator`
> > > 
> > >     There is also the GCC object_pool_allocator, which is used to 
> > > allocate 
> > > some
> > >     objects. Since these objects may be used later in the compilation by 
> > > other
> > >     threads, we can't simply make them private to each thread. Therefore 
> > > I 
> > > added a
> > >     threadsafe_object_pool_allocator object that currently uses locks 
> > > guarantee
> > >     safety, however I am not able to check its correctness. This is also
> > >     not efficient and might require a better approach later.
> > > >>>
> > > 
> > > I guess the same applies to the GC allocator - to make these
> > > more efficient we'd have a per-thread freelist we can allocate
> > > from without locking and which we'd, once empty, fill from the
> > > main pool in larger chunks with locking.  At thread finalization
> > > we have to return the freelist to the main allocator then
> > > and for the GC allocator possibly at garbage collection time.
> > 
> > This also sounds good. If I understeand correctly, the sequential
> > allocation state will be amortized, as we could ask for an object which
> > is 2x bigger than each thread currently has. (I am assuming allocation
> > is O(1))
> 
> Yeah, it's mostly trading (freelist) memory for speed.
> 
> > > 
> > > This scheme may also work for the bitmap default obstack
> > > and its freelist.  I would also suggest when you run into
> > > a specific issue with the default obstack to use a separate
> > > obstack in the respective area.  You mention issues with LTO,
> > > are the reproducible on the branch?  I suppose you are
> > > currently testing GCC with num_threads = 1 to make sure you
> > > are not introducing non-threading related issues?
> > 
> > Yes. The issue with LTO is that obstack, as I mentioned earlier. It is
> > reproducible in my branch, just compile it without any changes, then
> > declare the obstack TLS, recompile the changes and run gcc.dg tests.
> 
> Trying that but cannot reproduce any failure sofar.  Did you adjust
> bitmap_default_obstack_depth to be TLS as well?

No. You found the issue :P

>
> Richard.

Reply via email to