[Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

2020-06-27 Thread prop_design at protonmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

prop_design at protonmail dot com changed:

   What|Removed |Added

 CC||prop_design at protonmail dot 
com

--- Comment #19 from prop_design at protonmail dot com ---
hi everyone,

I'm not sure if this is the right place to ask this or not, but it relates to
the topic. I can't find the other thread about graphite auto-parallelization
that I made a long time ago.

I tried gfortran 10.1.0 via MSYS2. It seems to work very well on the latest
version of PROP_DESIGN. MP_PROP_DESIGN had some extra loops for benchmarking. I
found it made it harder for the optimizer so I deleted that code and just use
the 'real' version of the code it was based on called PROP_DESIGN_MAPS. So
that's the actual propeller design code with no additional looping for
benchmarking purposes.

I've found no Fortran compiler and do the auto-parallelization the way I would
like. The only code that would implement any at run time actually slowed the
code way down instead of sped it up.

I still have my original problem with gfortran. That is, at runtime no actual
parallelization occurs. The code runs the exact same as if the commands are not
present. Oddly though, the code does say it auto-parallelized many loops.
Although, not the loops that would really help, but at least it shows it's
doing something. That's an improvement from when I started these threads.

The problem is if I compile with the following:

gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static
-march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp -pthread
-ftree-parallelize-loops=2 -floop-parallelize-all -fopt-info-loop


It runs the exact same way as if I compile with:

gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static
-march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp


Again, gfortran does say it auto-parallelize some loops. So it's very odd. I
have searched the net and can't find anything that has helped.

I'm wondering if for Linux users, the code actually will work in parallel. That
would at least narrow the problem down some. I'm using Windows 10 and the code
will only run with one core. Compiling both ways it shows 2 threads used for
awhile and then drops to 1 thread.

The good news from when this was posted is that gfortran ran the code at the
same speed as the PGI Community Edition Compiler. Since they just stopped
developing that, I switched back to gfortran. I no longer have Intel Fortran to
test. That was the compiler that actually did run the code in parallel, but it
ran twice as slow instead of twice as fast. That was a year or two ago. I don't
know if it's any better now.

I'm wondering if there is some sort of issue with -pthread not being able to
call anything more than one core on Windows 10.

You can download PROP_DESIGN at https://propdesign.jimdofree.com

Inside the download are all the *.f files. I also have c.bat files in there
with the compiler options I used. The auto-parallelization commands are not
present, since they don't seem to be working still. At least on Windows 10.

The code now runs much faster than it used to, due to many bug fixes and
improvements I've made over the years. However, you can get it to run really
slow for testing purposes. In the settings file for the program change the
defaults like this:

1   ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM,
DEFAULT = 2)
2   ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER,
NON-DIM, DEFAULT = 2)

or like this

1   ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM,
DEFAULT = 2)
1   ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER,
NON-DIM, DEFAULT = 2)

The first runs very slow, the second incredibly slow. I just close the command
window once I've seen if the code is running in parallel or not. With the
defaults set at 2 for each of those values the code runs so fast you can't
really get a sense of what's going on.

Thanks for any help,

Anthony

[Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

2020-06-28 Thread prop_design at protonmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #22 from Anthony  ---
(In reply to Thomas Koenig from comment #21)
> Another question: Is there anything left to be done with the
> vectorizer, or could we remove that dependency?

thanks for looking into this again for me. i'm surprised it worked the same on
Linux, but knowing that, at least helps debug this issue some more. I'm not
sure about the vectorizer question, maybe that question was intended for
someone else. the runtimes seem good as is though. i doubt the
auto-parallelization will add much speed. but it's an interesting feature that
i've always hoped would work. i've never got it to work though. the only code
that did actually implement something was Intel Fortran. it implemented one
trivial loop, but it slowed the code down instead of speeding it up. the output
from gfortran shows more loops it wants to run in parallel. they aren't
important ones. but something would be better than nothing. if it slowed the
code down, i would just not use it.

there is something different in gfortran where it mentions a lot of 16bit
vectorization. i don't recall that from the past. but whatever it's doing,
seems fine from a speed perspective.

some compliments to the developers. the code compiles very fast compared to
other compilers. i'm really glad it doesn't rely on Microsoft Visual Studio.
that's a huge time consuming install. I was very happy I could finally
uninstall it. also, gfortran handles all my stop statements properly. PGI
Community Edition was adding in a bunch of non-sense output anytime a stop
command was issued. So it's nice to have the code work as intended again.

[Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

2020-06-29 Thread prop_design at protonmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #24 from Anthony  ---
(In reply to rguent...@suse.de from comment #23)
> On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > 
> > --- Comment #22 from Anthony  ---
> > (In reply to Thomas Koenig from comment #21)
> > > Another question: Is there anything left to be done with the
> > > vectorizer, or could we remove that dependency?
> > 
> > thanks for looking into this again for me. i'm surprised it worked the same 
> > on
> > Linux, but knowing that, at least helps debug this issue some more. I'm not
> > sure about the vectorizer question, maybe that question was intended for
> > someone else. the runtimes seem good as is though. i doubt the
> > auto-parallelization will add much speed. but it's an interesting feature 
> > that
> > i've always hoped would work. i've never got it to work though. the only 
> > code
> > that did actually implement something was Intel Fortran. it implemented one
> > trivial loop, but it slowed the code down instead of speeding it up. the 
> > output
> > from gfortran shows more loops it wants to run in parallel. they aren't
> > important ones. but something would be better than nothing. if it slowed the
> > code down, i would just not use it.
> 
> GCC adds runtime checks for a minimal number of iterations before
> dispatching to the parallelized code - I guess we simply never hit
> the threshold.  This is configurable via --param parloops-min-per-thread,
> the default is 100, the default number of threads is determined the same
> as for OpenMP so you can probably tune that via OMP_NUM_THREADS.

thanks for that tip. i tried changing the parloops parameters but no luck. the
only difference was the max thread use went from 2 to 3. core use was the same.

i added the following an some variations of these:

--param parloops-min-per-thread=2 (the default was 100 like you said) --param
parloops-chunk-size=1 (the default was zero so i removed this parameter later)
--param parloops-schedule=auto (tried all options except guided, the default is
static)

i was able to check that they were set via:

--help=param -Q

some other things i tried was adding -mthreads and removing -static. but so far
no luck. i also tried using -mthreads instead of -pthread.

i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
the ability of the optimizer to auto-parallelize (in the past at least).

[Bug fortran/53957] Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

2020-07-29 Thread prop_design at protonmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

--- Comment #25 from Anthony  ---
(In reply to Anthony from comment #24)
> (In reply to rguent...@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony  ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the 
> > > same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm 
> > > not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature 
> > > that
> > > i've always hoped would work. i've never got it to work though. the only 
> > > code
> > > that did actually implement something was Intel Fortran. it implemented 
> > > one
> > > trivial loop, but it slowed the code down instead of speeding it up. the 
> > > output
> > > from gfortran shows more loops it wants to run in parallel. they aren't
> > > important ones. but something would be better than nothing. if it slowed 
> > > the
> > > code down, i would just not use it.
> > 
> > GCC adds runtime checks for a minimal number of iterations before
> > dispatching to the parallelized code - I guess we simply never hit
> > the threshold.  This is configurable via --param parloops-min-per-thread,
> > the default is 100, the default number of threads is determined the same
> > as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
> 
> thanks for that tip. i tried changing the parloops parameters but no luck.
> the only difference was the max thread use went from 2 to 3. core use was
> the same.
> 
> i added the following an some variations of these:
> 
> --param parloops-min-per-thread=2 (the default was 100 like you said)
> --param parloops-chunk-size=1 (the default was zero so i removed this
> parameter later) --param parloops-schedule=auto (tried all options except
> guided, the default is static)
> 
> i was able to check that they were set via:
> 
> --help=param -Q
> 
> some other things i tried was adding -mthreads and removing -static. but so
> far no luck. i also tried using -mthreads instead of -pthread.
> 
> i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
> MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
> the ability of the optimizer to auto-parallelize (in the past at least).

I did more testing and it the added options actually slow the code way down.
however, it still is only using one core. from what i can tell if i set
OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago
where someone had the same problem. i think OMP_PLACES might be working on
linux but not on windows. that's what the thread i found was saying. don't
really know. but i've exhausted all the possibilities at this point. the only
thing i know for sure is i can't get it to use anything more than one core.

--- Comment #26 from Anthony  ---
(In reply to Anthony from comment #24)
> (In reply to rguent...@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony  ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the 
> > > same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm 
> > > not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature 
> > > that
> > > i've always hoped would work. i've never got it to work though. the only 
> > > code
> > > that did actually implement something was Intel Fortran. it implemented 
> > > one
> > > trivia

[Bug libgomp/97155] New: set GOMP_CPU_AFFINITY="0 1"

2020-09-21 Thread prop_design at protonmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97155

Bug ID: 97155
   Summary: set GOMP_CPU_AFFINITY="0 1"
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prop_design at protonmail dot com
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

i've tried for years to use the auto-parallelization feature of gcc. however,
it has never worked. recently, it shows that it's working better. however, i
still can't utilize it properly on windows. the extra threads it creates all
run on the same physical core. the best i can tell the problem is if i try to
set the cpu affinity it comes back as unsupported.

moreover, using this command:

set GOMP_CPU_AFFINITY="0 1"

yields this output:

libgomp: Affinity not supported on this configuration

i have a dual core amd processor and i'm using windows 10 home. currently i am
using the tdm-gcc version 9.2. i had also tried msys2 and before that
mingw-builds. msys2 all the sudden stopped allowing static builds. so i had to
switch to tdm-gcc. building static is a must for what i'm doing.

any help would be appreciated. the bug thread i was trying to get help with
this has gone cold. that thread is at:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957

if you need the source code and commands i'm trying to compile with, i can
provide that. however, it seems to be something more basic than what i was
looking into before. i've tried a number of different compiler options, before
i tracked it down to this.

thanks,

anthony