Re: GSOC Question about the parallelization project

Sebastiaan Peters Wed, 21 Mar 2018 12:05:03 -0700

>On Wed, Mar 21, 2018 at 9:50 AM, Sebastiaan Peters
><sebasp...@hotmail.com> wrote:
>>>On Tue, Mar 20, 2018 at 3:49 PM, David Malcolm <dmalc...@redhat.com> wrote:
>>>> On Tue, 2018-03-20 at 14:02 +0100, Richard Biener wrote:
>>>>> On Mon, Mar 19, 2018 at 9:55 PM, Richard Biener
>>>>> <richard.guent...@gmail.com> wrote:
>>>>> > On March 19, 2018 8:09:32 PM GMT+01:00, Sebastiaan Peters <sebaspe9
>>>>> > 7...@hotmail.com> wrote:
>>>>> > > > The goal should be to extend TU wise parallelism via make to
>>>>> > > > function
>>>>> > >
>>>>> > > wise parallelism within GCC.
>>>>> > >
>>>>> > > Could you please elaborate more on this?
>>>>> >
>>>>> > In the abstract sense you'd view the compilation process separated
>>>>> > into N stages, each function being processed by each. You'd assign
>>>>> > a thread to each stage and move the work items (the functions)
>>>>> > across the set of threads honoring constraints such as an IPA stage
>>>>> > needing all functions completed the previous stage. That allows you
>>>>> > to easier model the constraints due to shared state (like no pass
>>>>> > operating on two functions at the same time) compared to a model
>>>>> > where you assign a thread to each function.
>>>>> >
>>>>> > You'll figure that the easiest point in the pipeline to try this
>>>>> > 'pipelining' is after IPA has completed and until RTL is generated.
>>>>> >
>>>>> > Ideally the pipelining would start as early as the front ends
>>>>> > finished parsing a function and ideally we'd have multiple
>>>>> > functions in the RTL pipeline.
>>>>> >
>>>>> > The main obstacles will be the global state in the compiler of
>>>>> > which there is the least during the GIMPLE passes (mostly cfun and
>>>>> > current_function_decl plus globals in the individual passes which
>>>>> > is easiest dealt with by not allowing a single pass to run at the
>>>>> > same time in multiple threads). TLS can be used for some of the
>>>>> > global state plus of course some global data structures need
>>>>> > locking.
>>
>> This would mean that all the passes have to be individually analyzed for
>> which global state
>> they use, and lock/schedule them accordingly?
>
>Their private global state would be excempt by assuring that a pass never
>runs twice at the same time.
>
>The global state that remains should be the same for all passes we are talking
>about (during the late GIMPLE optimization phase which I'd tackle).
>
>> If this is the case, is there any documentation that describes the pre-reqs
>> for each pass?
>> I have looked at the internal documentation, however it seems that all of
>> this still has to be created?
>
>The prereqs are actually the same and not very well documented (if at all).
>There's the global GC memory pool where we allocate statements and
>stuff like that from (and luckyly statements themselves are function private).
>Then there's global trees like types ('int') where modification needs to be
>appropriately guarded.  Note that "modification" means for example
>building a new type for the address of 'int' given that all different
>pointer types
>to 'int' are chained in a list rooted in the tree for 'int'.  That
>means (a few?)
>tree building helpers need to be guarded with locks.  I don't have a great
>idea how to identify those apart from knowing them in advance or running
>into races ... my gut feeling is that there's not a lot to guard but I may
>be wrong ;)


What does it mean to be a node of a type tree?
Does it describe information about that type,
or does it keep a reference to where something
of that type has been declared?

>> As to how this would be implemented, it seems the easiest way would be to
>> extend the macros to
>> accept a pre-req pass. This would encourage more documentation since the
>> dependencies
>> become explicit instead of the current implicit ordering.
>
>Actually the order is quite explicit.  Maybe I now understand your
>question - no,
>passes do not "communicate" between each other via global state, all such
>state is per function and the execution order of passes on a given function
>is hard-coded in passes.def.
>
>> Assuming the dependencies for the all the tree-ssa passes have to be
>> individually analyzed.
>> Currently I have this as my timeline:
>>     - Parallelize the execution of the post-IPA pre-RTL, and a few tree-ssa
>> passes (mid-may - early june)
>>     - Test for possible reproducibility issues for the binary/debug info
>> (early june - late june)
>>     - Parallelize the rest of tree-ssa (late june - late july)
>>     - Update documentation and test again for reproducibility issues (late
>> july - early august)
>>
>> Would this be acceptable?
>
>Sounds ambitious ;)  But yes, it sounds reasonable.  I don't exactly
>understand what "Parallelize the rest of tree-ssa" means though.  If
>you want to tackle a tiny bit more I'd rather include "RTL" by treating
>the whole RTL part as a single "pass" (as said only one function can
>be in RTL right now).
>

I was under the assumption that passes had to be indivdually analysed
when I wrote that. The timeline is now updated to include some extra
time in the beginning to familiarize myself a bit more with the project,
and added the RTL as an optional deliverable:)

I added a draft of the proposal[0], any feedback would be highly appreciated.

>>>>> Oh, and just to mention - there are a few things that may block
>>>>> adoption in the end
>>>>> like whether builds are still reproducible (we allocate things like
>>>>> DECL_UID from
>>>>> global pools and doing that somewhat randomly because of threading
>>>>> might - but not
>>>>> must - change code generation).  Or that some diagnostics will appear
>>>>> in
>>>>> non-deterministic order, or that dump files are messed up (both
>>>>> issues could be
>>>>> solved by code dealing with the issue, like buffering and doing a re-
>>>>> play in
>>>>> program order).  I guess reproducability is important when it comes
>>>>> down to
>>>>> debugging code-generation issues - I'd prefer to debug gcc when it
>>>>> doesn't run
>>>>> threaded but if that doesn't reproduce an issue that's bad.
>>>>>
>>>>> So the most important "milestone" of this project is to identify such
>>>>> issues and
>>>>> document them somewhere.
>>>>
>>>> One issue would be the garbage-collector: there are plenty of places in
>>>> GCC that have hidden assumptions that "a collection can't happen here"
>>>> (where we have temporaries that reference GC-managed objects, but which
>>>> aren't tracked by GC-roots).
>>>>
>>>> I had some patches for that back in 2014 that I think I managed to drop
>>>> on the floor (sorry):
>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01300.html
>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01340.html
>>>>   https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01510.html
>>
>> Would there be a way to easily create a static analyzer to find these
>> untracked temporaries?
>
>I don't think so.
>
>> A quick look at registered passes makes me count ~135 tree-ssa passes,
>> So your code on analyzing what globals are referenced where might come in
>> handy while analyzing if passes are easily parallelized.
>
>I think the "solution" to the garbage collecting issue is to simply keep
>that serialized.  It's currently done on-demand anyway at certain
>safe collection points so the work scheduler can simply hold off
>scheduling more work when the collector would want to run, waiting for
>running jobs to complete.
>
>Richard.
>
>>>> The GC's allocator is used almost everywhere, and is probably not
>>>> thread-safe yet.
>>>Yes.  There's also global tree modification like chaining new
>>>pointer types into TYPE_POINTER_TO and friends so some
>>>helpers in tree.c need to be guarded as well.
>>>> FWIW I gave a talk at Cauldron 2013 about global state in GCC.  Beware:
>>>> it's five years out-of-date, but maybe is still relevant in places?
>>>>   https://dmalcolm.fedorapeople.org/gcc/global-state/
>>>>   https://gcc.gnu.org/ml/gcc/2013-05/msg00015.html
>>>> (I tackled this for libgccjit by instead introducing a mutex, a "big
>>>> compiler lock", jit_mutex in gcc/jit/jit-playback.c, held by whichever
>>>> thread is calling into the rest of the compiler sources).
>>>>
>>>> Hope this is helpful
>>>> Dave
>>>>
>>>> 

[0] 
https://docs.google.com/document/d/1YkKYI3J-pKFfHLdxljWVoJ89l2oVGibxslIW3ikluz8/edit?usp=sharing

Re: GSOC Question about the parallelization project

Reply via email to