[Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

prop_design at protonmail dot com Wed, 29 Jul 2020 15:26:18 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957


--- Comment #25 from Anthony <prop_design at protonmail dot com> ---
(In reply to Anthony from comment #24)
> (In reply to rguent...@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the 
> > > same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm 
> > > not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature 
> > > that
> > > i've always hoped would work. i've never got it to work though. the only 
> > > code
> > > that did actually implement something was Intel Fortran. it implemented 
> > > one
> > > trivial loop, but it slowed the code down instead of speeding it up. the 
> > > output
> > > from gfortran shows more loops it wants to run in parallel. they aren't
> > > important ones. but something would be better than nothing. if it slowed 
> > > the
> > > code down, i would just not use it.
> > 
> > GCC adds runtime checks for a minimal number of iterations before
> > dispatching to the parallelized code - I guess we simply never hit
> > the threshold.  This is configurable via --param parloops-min-per-thread,
> > the default is 100, the default number of threads is determined the same
> > as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
> 
> thanks for that tip. i tried changing the parloops parameters but no luck.
> the only difference was the max thread use went from 2 to 3. core use was
> the same.
> 
> i added the following an some variations of these:
> 
> --param parloops-min-per-thread=2 (the default was 100 like you said)
> --param parloops-chunk-size=1 (the default was zero so i removed this
> parameter later) --param parloops-schedule=auto (tried all options except
> guided, the default is static)
> 
> i was able to check that they were set via:
> 
> --help=param -Q
> 
> some other things i tried was adding -mthreads and removing -static. but so
> far no luck. i also tried using -mthreads instead of -pthread.
> 
> i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
> MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
> the ability of the optimizer to auto-parallelize (in the past at least).

I did more testing and it the added options actually slow the code way down.
however, it still is only using one core. from what i can tell if i set
OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago
where someone had the same problem. i think OMP_PLACES might be working on
linux but not on windows. that's what the thread i found was saying. don't
really know. but i've exhausted all the possibilities at this point. the only
thing i know for sure is i can't get it to use anything more than one core.

--- Comment #26 from Anthony <prop_design at protonmail dot com> ---
(In reply to Anthony from comment #24)
> (In reply to rguent...@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the 
> > > same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm 
> > > not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature 
> > > that
> > > i've always hoped would work. i've never got it to work though. the only 
> > > code
> > > that did actually implement something was Intel Fortran. it implemented 
> > > one
> > > trivial loop, but it slowed the code down instead of speeding it up. the 
> > > output
> > > from gfortran shows more loops it wants to run in parallel. they aren't
> > > important ones. but something would be better than nothing. if it slowed 
> > > the
> > > code down, i would just not use it.
> > 
> > GCC adds runtime checks for a minimal number of iterations before
> > dispatching to the parallelized code - I guess we simply never hit
> > the threshold.  This is configurable via --param parloops-min-per-thread,
> > the default is 100, the default number of threads is determined the same
> > as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
> 
> thanks for that tip. i tried changing the parloops parameters but no luck.
> the only difference was the max thread use went from 2 to 3. core use was
> the same.
> 
> i added the following an some variations of these:
> 
> --param parloops-min-per-thread=2 (the default was 100 like you said)
> --param parloops-chunk-size=1 (the default was zero so i removed this
> parameter later) --param parloops-schedule=auto (tried all options except
> guided, the default is static)
> 
> i was able to check that they were set via:
> 
> --help=param -Q
> 
> some other things i tried was adding -mthreads and removing -static. but so
> far no luck. i also tried using -mthreads instead of -pthread.
> 
> i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
> MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
> the ability of the optimizer to auto-parallelize (in the past at least).

I did more testing and it the added options actually slow the code way down.
however, it still is only using one core. from what i can tell if i set
OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago
where someone had the same problem. i think OMP_PLACES might be working on
linux but not on windows. that's what the thread i found was saying. don't
really know. but i've exhausted all the possibilities at this point. the only
thing i know for sure is i can't get it to use anything more than one core.

--- Comment #27 from Anthony <prop_design at protonmail dot com> ---
so after trying a bunch of things, i think the final problem may be this. i get
the following result when i try to set thread affinity:

set GOMP_CPU_AFFINITY="0 1"

gives the following feedback at run time; libgomp: Affinity not supported on
this configuration

i have to close the command prompt window to stop the program. the program
doesn't run properly if i try to set thread affinity.

so this still makes me thing it might work on linux and not windows 10, but i
have no way to test that.

the extra threads that auto-parallelization create will only go to one core, on
my machine at least.

[Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

Reply via email to