Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Charles A Taylor via users
This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought the fix 
was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x.  Most of 
our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely included the 
fix.

See…

- Apply patch for memory leak associated with UCX PML.
-https://github.com/openucx/ucx/issues/2921
-https://github.com/open-mpi/ompi/pull/5878

Charles Taylor
UF Research Computing


> On Jun 19, 2019, at 2:26 PM, Noam Bernstein via users 
>  wrote:
> 
>> On Jun 19, 2019, at 2:00 PM, John Hearns via users > > wrote:
>> 
>> Noam, it may be a stupid question. Could you try runningslabtop   ss the 
>> program executes
> 
> The top SIZE usage is this line
>OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME  
>  
> 5937540 5937540 100%0.09K 141370 42565480K kmalloc-96
> which seems to be growing continuously. However, it’s much smaller than the 
> drop in free memory.  It gets to around 1 GB after tens of seconds (500 MB 
> here), but the overall free memory is dropping by about 1 GB / second, so 
> tens of GB over the same time.
> 
>> 
>> Also  'watch  cat /proc/meminfo'is also a good diagnostic
> 
> Other than MemFree dropping, I don’t see much. Here’s a diff, 10 seconds 
> apart:
> 2,3c2,3
> < MemFree:54229400 kB
> < MemAvailable:   54271804 kB
> ---
> > MemFree:45010772 kB
> > MemAvailable:   45054200 kB
> 19c19
> < AnonPages:  22063260 kB
> ---
> > AnonPages:  22526300 kB
> 22,24c22,24
> < Slab: 851380 kB
> < SReclaimable:  87100 kB
> < SUnreclaim:   764280 kB
> ---
> > Slab:1068208 kB
> > SReclaimable:  89148 kB
> > SUnreclaim:   979060 kB
> 31c31
> < Committed_AS:   34976896 kB
> ---
> > Committed_AS:   34977680 kB
> 
> MemFree has dropped by 9 GB, but as far as I can tell nothing else has 
> increased by anything near as much, so I don’t know where the memory is going.
> 
>   Noam
> 
> 
> 
> ||
> |U.S. NAVAL|
> |_RESEARCH_|
> LABORATORY
> 
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=NpYP1iUbEbTx87BW8Gx5ow&m=uR1yQLj0g46Qb_ELHglK3ck3gNxjVqxMHyRu2bcfRQo&s=oTZPqoXvy0rvbh3Ni6Mquuzel8PXWG1ub4-c6xleDnQ&e=

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users


> On Jun 20, 2019, at 4:44 AM, Charles A Taylor  wrote:
> 
> This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought the fix 
> was landed in 4.0.0 but you might
> want to check the code to be sure there wasn’t a regression in 4.1.x.  Most 
> of our codes are still running
> 3.1.2 so I haven’t built anything beyond 4.0.0 which definitely included the 
> fix.

Unfortunately, 4.0.0 behaves the same.  

One thing that I’m wondering if anyone familiar with the internals can explain 
is how you get a memory leak that isn’t freed when then program ends?  Doesn’t 
that suggest that it’s something lower level, like maybe a kernel issue?

Noam


||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Jeff Squyres (jsquyres) via users
On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users 
 wrote:
> 
> One thing that I’m wondering if anyone familiar with the internals can 
> explain is how you get a memory leak that isn’t freed when then program ends? 
>  Doesn’t that suggest that it’s something lower level, like maybe a kernel 
> issue?

If "top" doesn't show processes eating up the memory, and killing processes 
(e.g., MPI processes) doesn't give you memory back, then it's likely that 
something in the kernel is leaking memory.

Have you tried the latest version of UCX -- including their kernel drivers -- 
from Mellanox (vs. inbox/CentOS)?

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread John Hearns via users
The kernel using memory is why I suggested running slabtop, to see the
kernel slab allocations.
Clearly I Was barking up a wrong tree there...

On Thu, 20 Jun 2019 at 14:41, Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users <
> users@lists.open-mpi.org> wrote:
> >
> > One thing that I’m wondering if anyone familiar with the internals can
> explain is how you get a memory leak that isn’t freed when then program
> ends?  Doesn’t that suggest that it’s something lower level, like maybe a
> kernel issue?
>
> If "top" doesn't show processes eating up the memory, and killing
> processes (e.g., MPI processes) doesn't give you memory back, then it's
> likely that something in the kernel is leaking memory.
>
> Have you tried the latest version of UCX -- including their kernel drivers
> -- from Mellanox (vs. inbox/CentOS)?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 9:40 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Jun 20, 2019, at 9:31 AM, Noam Bernstein via users 
>  wrote:
>> 
>> One thing that I’m wondering if anyone familiar with the internals can 
>> explain is how you get a memory leak that isn’t freed when then program 
>> ends?  Doesn’t that suggest that it’s something lower level, like maybe a 
>> kernel issue?
> 
> If "top" doesn't show processes eating up the memory, and killing processes 
> (e.g., MPI processes) doesn't give you memory back, then it's likely that 
> something in the kernel is leaking memory.

That’s definitely what’s happening.  “free" is reporting a lot of memory used, 
but adding the values from ps is much lower.

> 
> Have you tried the latest version of UCX -- including their kernel drivers -- 
> from Mellanox (vs. inbox/CentOS)?
> 

I’ve tried the latest ucx from the ucx web site, 1.5.1, which doesn’t change 
the behavior.

I haven’t yet tried the latest OFED or Mellanox low level stuff.  That’s next 
on my list, but slightly more involved to do, so I’ve been avoiding it.

thanks,
Noam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Yann Jobic via users

Hi,

Le 6/20/2019 à 3:31 PM, Noam Bernstein via users a écrit :



On Jun 20, 2019, at 4:44 AM, Charles A Taylor > wrote:


This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought 
the fix was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x. 
 Most of our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely 
included the fix.


Unfortunately, 4.0.0 behaves the same.

One thing that I’m wondering if anyone familiar with the internals can 
explain is how you get a memory leak that isn’t freed when then program 
ends?  Doesn’t that suggest that it’s something lower level, like maybe 
a kernel issue?


Maybe it's only some data in cache memory, which is tagged as "used", 
but the kernel could use it, if needed. Have you tried to use the whole 
memory again with your code ? It sould work.


Yann



Noam


|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY

Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread John Hearns via users
Errr..   you chave dropped caches?   echo 3 > /proc/sys/vm/drop_caches


On Thu, 20 Jun 2019 at 15:59, Yann Jobic via users 
wrote:

> Hi,
>
> Le 6/20/2019 à 3:31 PM, Noam Bernstein via users a écrit :
> >
> >
> >> On Jun 20, 2019, at 4:44 AM, Charles A Taylor  >> > wrote:
> >>
> >> This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought
> >> the fix was landed in 4.0.0 but you might
> >> want to check the code to be sure there wasn’t a regression in 4.1.x.
> >>  Most of our codes are still running
> >> 3.1.2 so I haven’t built anything beyond 4.0.0 which definitely
> >> included the fix.
> >
> > Unfortunately, 4.0.0 behaves the same.
> >
> > One thing that I’m wondering if anyone familiar with the internals can
> > explain is how you get a memory leak that isn’t freed when then program
> > ends?  Doesn’t that suggest that it’s something lower level, like maybe
> > a kernel issue?
>
> Maybe it's only some data in cache memory, which is tagged as "used",
> but the kernel could use it, if needed. Have you tried to use the whole
> memory again with your code ? It sould work.
>
> Yann
>
> >
> > Noam
> >
> > 
> > |
> > |
> > |
> > *U.S. NAVAL*
> > |
> > |
> > _*RESEARCH*_
> > |
> > LABORATORY
> >
> > Noam Bernstein, Ph.D.
> > Center for Materials Physics and Technology
> > U.S. Naval Research Laboratory
> > T +1 202 404 8628  F +1 202 404 7546
> > https://www.nrl.navy.mil
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Jeff Squyres (jsquyres) via users
On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users 
 wrote:
> 
> Hi Jeff - do you remember this issue from a couple of months ago?  

Noam: I'm sorry, I totally missed this email.  My INBOX is a continual 
disaster.  :-(

> Unfortunately, the failure to find pmi.h is still happening.  I just tried 
> with 4.0.1 (not rc), and I still run into the same error (failing to find 
> #include  when compiling opal/mca/pmix/s1/mca_pmix_s1_la-pmix_s1.lo):
> make[2]: Entering directory 
> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>   CC   mca_pmix_s1_la-pmix_s1.lo
> pmix_s1.c:29:17: fatal error: pmi.h: No such file or directory
>  #include 
>  ^
> compilation terminated.
> make[2]: *** [mca_pmix_s1_la-pmix_s1.lo] Error 1
> make[2]: Leaving directory 
> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory 
> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal'
> make: *** [all-recursive] Error 1

I looked back earlier in this thread, and I don't see the version of SLRUM that 
you're using.  What version is it?

Is there a pmi2.h in the SLURM installation (i.e., not pmi.h)?

Or is the problem that -I/usr/include/slurm is not passed to the compile line 
(per your output, below)?

> When I dig into what libtool is trying to do, I get (once I remove the 
> —silent flag):

(FWIW, you can also "make V=1" to have it show you all this detail)

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Intel Compilers

2019-06-20 Thread Charles A Taylor via users
OpenMPI probably has one of the largest and most complete configure+build 
systems I’ve ever seen.  

I’m surprised however that it doesn’t pick up the use of the intel compilers 
and modify the command line
parameters as needed.

ifort: command line warning #10006: ignoring unknown option '-pipe'
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong 
type
ifort: command line warning #10006: ignoring unknown option 
'-fparam=ssp-buffer-size=4'
ifort: command line warning #10006: ignoring unknown option '-pipe'
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong 
type
ifort: command line warning #10006: ignoring unknown option 
'-fparam=ssp-buffer-size=4'
ifort: command line warning #10006: ignoring unknown option '-pipe'
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong 
type
ifort: command line warning #10006: ignoring unknown option 
'-fparam=ssp-buffer-size=4’

Maybe I’m missing something.

Regards,

Charlie Taylor
UF Research Computing___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users


> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users 
>  wrote:
>> 
>> Hi Jeff - do you remember this issue from a couple of months ago?  
> 
> Noam: I'm sorry, I totally missed this email.  My INBOX is a continual 
> disaster.  :-(

No problem.  We’re running with mpirun for now.

> 
>> Unfortunately, the failure to find pmi.h is still happening.  I just tried 
>> with 4.0.1 (not rc), and I still run into the same error (failing to find 
>> #include  when compiling opal/mca/pmix/s1/mca_pmix_s1_la-pmix_s1.lo):
>> make[2]: Entering directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>>  CC   mca_pmix_s1_la-pmix_s1.lo
>> pmix_s1.c:29:17: fatal error: pmi.h: No such file or directory
>> #include 
>> ^
>> compilation terminated.
>> make[2]: *** [mca_pmix_s1_la-pmix_s1.lo] Error 1
>> make[2]: Leaving directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal'
>> make: *** [all-recursive] Error 1
> 
> I looked back earlier in this thread, and I don't see the version of SLRUM 
> that you're using.  What version is it?

18.08, provided for our CentOS 7.6-based Rocks through the slurm roll, so not 
compiled by me.

> 
> Is there a pmi2.h in the SLURM installation (i.e., not pmi.h)?
> 
> Or is the problem that -I/usr/include/slurm is not passed to the compile line 
> (per your output, below)?

/usr/include/slurm has both pmi.h and pmi2.h, but (from what I could tell when 
trying to manually reproduce what make is doing)
-I/usr/include/slurm 
is not being passed when compiling those files.

> 
>> When I dig into what libtool is trying to do, I get (once I remove the 
>> —silent flag):
> 
> (FWIW, you can also "make V=1" to have it show you all this detail)

I’ll check that, to confirm that I’m correct about it not being passed.

Noam


||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Intel Compilers

2019-06-20 Thread Charles A Taylor via users


> On Jun 20, 2019, at 12:10 PM, Carlson, Timothy S  
> wrote:
> 
> I’ve never seen that error and have built some flavor of this combination 
> dozens of times.  What version of Intel Compiler and what version of OpenMPI 
> are you trying to build?

[chasman@login4 gizmo-mufasa]$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, 
Version 19.0.1.144 Build 20181018
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

OpenMPI 4.0.1 

It is probably something I/we are doing that is throwing the configure script 
and macros off.  We include some version (7.3.0 in this case) of gcc in our 
command and library paths because icpc needs the gnu headers for certain 
things.  Perhaps the configure script is picking that up and thinks we are 
using gnu.   

I’ll have to look more closely now that I know I’m the only one seeing it.  :(

Charlie Taylor
UF Research Computing


>  
> Tim
>  
> From: users  > On Behalf Of Charles A Taylor via 
> users
> Sent: Thursday, June 20, 2019 8:55 AM
> To: Open MPI Users  >
> Cc: Charles A Taylor mailto:chas...@ufl.edu>>
> Subject: [OMPI users] Intel Compilers
>  
> OpenMPI probably has one of the largest and most complete configure+build 
> systems I’ve ever seen.  
>  
> I’m surprised however that it doesn’t pick up the use of the intel compilers 
> and modify the command line
> parameters as needed.
>  
> ifort: command line warning #10006: ignoring unknown option '-pipe'
> ifort: command line warning #10157: ignoring option '-W'; argument is of 
> wrong type
> ifort: command line warning #10006: ignoring unknown option 
> '-fparam=ssp-buffer-size=4'
> ifort: command line warning #10006: ignoring unknown option '-pipe'
> ifort: command line warning #10157: ignoring option '-W'; argument is of 
> wrong type
> ifort: command line warning #10006: ignoring unknown option 
> '-fparam=ssp-buffer-size=4'
> ifort: command line warning #10006: ignoring unknown option '-pipe'
> ifort: command line warning #10157: ignoring option '-W'; argument is of 
> wrong type
> ifort: command line warning #10006: ignoring unknown option 
> '-fparam=ssp-buffer-size=4’
>  
> Maybe I’m missing something.
>  
> Regards,
>  
> Charlie Taylor
> UF Research Computing

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Intel Compilers

2019-06-20 Thread Jeff Squyres (jsquyres) via users
Can you send the exact ./configure line you are using to configure Open MPI?


> On Jun 20, 2019, at 12:32 PM, Charles A Taylor via users 
>  wrote:
> 
> 
> 
>> On Jun 20, 2019, at 12:10 PM, Carlson, Timothy S  
>> wrote:
>> 
>> I’ve never seen that error and have built some flavor of this combination 
>> dozens of times.  What version of Intel Compiler and what version of OpenMPI 
>> are you trying to build?
> 
> [chasman@login4 gizmo-mufasa]$ ifort -V
> Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 
> 64, Version 19.0.1.144 Build 20181018
> Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.
> 
> OpenMPI 4.0.1 
> 
> It is probably something I/we are doing that is throwing the configure script 
> and macros off.  We include some version (7.3.0 in this case) of gcc in our 
> command and library paths because icpc needs the gnu headers for certain 
> things.  Perhaps the configure script is picking that up and thinks we are 
> using gnu.   
> 
> I’ll have to look more closely now that I know I’m the only one seeing it.  :(
> 
> Charlie Taylor
> UF Research Computing
> 
> 
>>  
>> Tim
>>  
>> From: users  On Behalf Of Charles A Taylor 
>> via users
>> Sent: Thursday, June 20, 2019 8:55 AM
>> To: Open MPI Users 
>> Cc: Charles A Taylor 
>> Subject: [OMPI users] Intel Compilers
>>  
>> OpenMPI probably has one of the largest and most complete configure+build 
>> systems I’ve ever seen.  
>>  
>> I’m surprised however that it doesn’t pick up the use of the intel compilers 
>> and modify the command line
>> parameters as needed.
>>  
>> ifort: command line warning #10006: ignoring unknown option '-pipe'
>> ifort: command line warning #10157: ignoring option '-W'; argument is of 
>> wrong type
>> ifort: command line warning #10006: ignoring unknown option 
>> '-fparam=ssp-buffer-size=4'
>> ifort: command line warning #10006: ignoring unknown option '-pipe'
>> ifort: command line warning #10157: ignoring option '-W'; argument is of 
>> wrong type
>> ifort: command line warning #10006: ignoring unknown option 
>> '-fparam=ssp-buffer-size=4'
>> ifort: command line warning #10006: ignoring unknown option '-pipe'
>> ifort: command line warning #10157: ignoring option '-W'; argument is of 
>> wrong type
>> ifort: command line warning #10006: ignoring unknown option 
>> '-fparam=ssp-buffer-size=4’
>>  
>> Maybe I’m missing something.
>>  
>> Regards,
>>  
>> Charlie Taylor
>> UF Research Computing
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Jeff Squyres (jsquyres) via users
Ok.

Perhaps we still missed something in the configury.

Worst case, you can:

$ ./configure CPPFLAGS=-I/usr/include/slurm ...rest of your configure params...

That will add the -I to CPPFLAGS, and it will preserve that you set that value 
in the top few lines of config.log.



On Jun 20, 2019, at 12:25 PM, Carlson, Timothy S  
wrote:
> 
> As of recent you needed to use --with-slurm and --with-pmi2
>  
> While the configure line indicates it picks up pmi2 as part of slurm that is 
> not in fact true and you need to specifically tell it about pmi2
>  
> From: users  On Behalf Of Noam Bernstein 
> via users
> Sent: Thursday, June 20, 2019 9:16 AM
> To: Jeff Squyres (jsquyres) 
> Cc: Noam Bernstein ; Open MPI User's List 
> 
> Subject: Re: [OMPI users] OpenMPI 4 and pmi2 support
>  
>  
> 
> 
> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres)  
> wrote:
>  
> On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users 
>  wrote:
> 
> 
> Hi Jeff - do you remember this issue from a couple of months ago?  
> 
> Noam: I'm sorry, I totally missed this email.  My INBOX is a continual 
> disaster.  :-(
>  
> No problem.  We’re running with mpirun for now.
>  
> 
> 
> Unfortunately, the failure to find pmi.h is still happening.  I just tried 
> with 4.0.1 (not rc), and I still run into the same error (failing to find 
> #include  when compiling opal/mca/pmix/s1/mca_pmix_s1_la-pmix_s1.lo):
> make[2]: Entering directory 
> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>  CC   mca_pmix_s1_la-pmix_s1.lo
> pmix_s1.c:29:17: fatal error: pmi.h: No such file or directory
> #include 
> ^
> compilation terminated.
> make[2]: *** [mca_pmix_s1_la-pmix_s1.lo] Error 1
> make[2]: Leaving directory 
> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory 
> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal'
> make: *** [all-recursive] Error 1
> 
> I looked back earlier in this thread, and I don't see the version of SLRUM 
> that you're using.  What version is it?
>  
> 18.08, provided for our CentOS 7.6-based Rocks through the slurm roll, so not 
> compiled by me.
> 
> 
> 
> Is there a pmi2.h in the SLURM installation (i.e., not pmi.h)?
> 
> Or is the problem that -I/usr/include/slurm is not passed to the compile line 
> (per your output, below)?
>  
> /usr/include/slurm has both pmi.h and pmi2.h, but (from what I could tell 
> when trying to manually reproduce what make is doing)
> -I/usr/include/slurm 
> is not being passed when compiling those files.
>  
> 
> 
> When I dig into what libtool is trying to do, I get (once I remove the 
> —silent flag):
> 
> (FWIW, you can also "make V=1" to have it show you all this detail)
>  
> I’ll check that, to confirm that I’m correct about it not being passed.
>  
>   
>  Noam
>  
> 
> |
> |
> |
> U.S. NAVAL
> |
> |
> _RESEARCH_
> |
> LABORATORY
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 12:25 PM, Carlson, Timothy S  
> wrote:
> 
> As of recent you needed to use --with-slurm and --with-pmi2
>  
> While the configure line indicates it picks up pmi2 as part of slurm that is 
> not in fact true and you need to specifically tell it about pmi2

When I do “./configure —help” there’s no mention of any option related to pmi2 
except the description of —with-pmi-libdir, which I set to /usr/lib64, which 
contains both libpmi and libpmi2.  I tried to pass —with-pmi2 in anyway, and got
configure: WARNING: unrecognized options: --with-pmi2

I tried “make V=1”, and this is the command that fails. Note that it doesn’t 
have any reference to the /usr/include/slurm directory I passed to configure in 
—with-pmi

libtool: compile:  gcc -std=gnu99 -std=gnu99 -DHAVE_CONFIG_H -I. 
-I../../../../opal/include -I../../../../ompi/include 
-I../../../../oshmem/include 
-I../../../../opal/mca/hwloc/hwloc201/hwloc/include/private/autogen 
-I../../../../opal/mca/hwloc/hwloc201/hwloc/include/hwloc/autogen 
-I../../../../ompi/mpiext/cuda/c -I../../../.. -I../../../../orte/include 
-I/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/event/libevent2022/libevent
 
-I/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/event/libevent2022/libevent/include
 
-I/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/hwloc/hwloc201/hwloc/include
 -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -mcx16 -pthread -MT 
mca_pmix_s1_la-pmix_s1.lo -MD -MP -MF .deps/mca_pmix_s1_la-pmix_s1.Tpo -c 
pmix_s1.c  -fPIC -DPIC -o .libs/mca_pmix_s1_la-pmix_s1.o
pmix_s1.c:29:17: fatal error: pmi.h: No such file or directory
 #include 
 ^
compilation terminated.
make[2]: *** [mca_pmix_s1_la-pmix_s1.lo] Error 1

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Joseph Schuchart via users

Noam,

Another idea: check for stale files in /dev/shm/ (or a subdirectory that 
looks like it belongs to UCX/OpenMPI) and SysV shared memory using `ipcs 
-m`.


Joseph

On 6/20/19 3:31 PM, Noam Bernstein via users wrote:



On Jun 20, 2019, at 4:44 AM, Charles A Taylor > wrote:


This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought 
the fix was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x. 
 Most of our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely 
included the fix.


Unfortunately, 4.0.0 behaves the same.

One thing that I’m wondering if anyone familiar with the internals can 
explain is how you get a memory leak that isn’t freed when then program 
ends?  Doesn’t that suggest that it’s something lower level, like maybe 
a kernel issue?


Noam


|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY

Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Charles A Taylor via users
Sure…

+ ./configure 
  --build=x86_64-redhat-linux-gnu \
  --host=x86_64-redhat-linux-gnu \
  --program-prefix= \
  --disable-dependency-tracking \
  --prefix=/apps/mpi/intel/2019.1.144/openmpi/4.0.1 \
  --exec-prefix=/apps/mpi/intel/2019.1.144/openmpi/4.0.1 \
  --bindir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/bin \
  --sbindir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/sbin \
  --sysconfdir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/etc \
  --datadir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/share \
  --includedir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/include \
  --libdir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/lib64 \
  --libexecdir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/libexec \
  --localstatedir=/var \
  --sharedstatedir=/var/lib \
  --mandir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/share/man \
  --infodir=/apps/mpi/intel/2019.1.144/openmpi/4.0.1/share/info \
  C=icc CXX=icpc FC=ifort 'FFLAGS=-O2 -g -warn -m64' LDFLAGS= \
  --enable-static \
  --enable-orterun-prefix-by-default \
  --with-slurm=/opt/slurm \
  --with-pmix=/opt/pmix/3.1.2 \
  --with-pmi=/opt/slurm \
  --with-libevent=external \
  --with-hwloc=external \
  --without-verbs \
  --with-libfabric \
  --with-ucx \
  --with-mxm=no \
  --with-cuda=no \
  --enable-openib-udcm \
  --enable-openib-rdmacm


> On Jun 20, 2019, at 12:49 PM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Ok.
> 
> Perhaps we still missed something in the configury.
> 
> Worst case, you can:
> 
> $ ./configure CPPFLAGS=-I/usr/include/slurm ...rest of your configure 
> params...
> 
> That will add the -I to CPPFLAGS, and it will preserve that you set that 
> value in the top few lines of config.log.
> 
> 
> 
> On Jun 20, 2019, at 12:25 PM, Carlson, Timothy S  
> wrote:
>> 
>> As of recent you needed to use --with-slurm and --with-pmi2
>> 
>> While the configure line indicates it picks up pmi2 as part of slurm that is 
>> not in fact true and you need to specifically tell it about pmi2
>> 
>> From: users  On Behalf Of Noam Bernstein 
>> via users
>> Sent: Thursday, June 20, 2019 9:16 AM
>> To: Jeff Squyres (jsquyres) 
>> Cc: Noam Bernstein ; Open MPI User's List 
>> 
>> Subject: Re: [OMPI users] OpenMPI 4 and pmi2 support
>> 
>> 
>> 
>> 
>> On Jun 20, 2019, at 11:54 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> On Jun 14, 2019, at 2:02 PM, Noam Bernstein via users 
>>  wrote:
>> 
>> 
>> Hi Jeff - do you remember this issue from a couple of months ago?  
>> 
>> Noam: I'm sorry, I totally missed this email.  My INBOX is a continual 
>> disaster.  :-(
>> 
>> No problem.  We’re running with mpirun for now.
>> 
>> 
>> 
>> Unfortunately, the failure to find pmi.h is still happening.  I just tried 
>> with 4.0.1 (not rc), and I still run into the same error (failing to find 
>> #include  when compiling opal/mca/pmix/s1/mca_pmix_s1_la-pmix_s1.lo):
>> make[2]: Entering directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>> CC   mca_pmix_s1_la-pmix_s1.lo
>> pmix_s1.c:29:17: fatal error: pmi.h: No such file or directory
>> #include 
>>^
>> compilation terminated.
>> make[2]: *** [mca_pmix_s1_la-pmix_s1.lo] Error 1
>> make[2]: Leaving directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal/mca/pmix/s1'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory 
>> `/home_tin/bernadm/configuration/110_compile_mpi/OpenMPI/openmpi-4.0.1/opal'
>> make: *** [all-recursive] Error 1
>> 
>> I looked back earlier in this thread, and I don't see the version of SLRUM 
>> that you're using.  What version is it?
>> 
>> 18.08, provided for our CentOS 7.6-based Rocks through the slurm roll, so 
>> not compiled by me.
>> 
>> 
>> 
>> Is there a pmi2.h in the SLURM installation (i.e., not pmi.h)?
>> 
>> Or is the problem that -I/usr/include/slurm is not passed to the compile 
>> line (per your output, below)?
>> 
>> /usr/include/slurm has both pmi.h and pmi2.h, but (from what I could tell 
>> when trying to manually reproduce what make is doing)
>> -I/usr/include/slurm 
>> is not being passed when compiling those files.
>> 
>> 
>> 
>> When I dig into what libtool is trying to do, I get (once I remove the 
>> —silent flag):
>> 
>> (FWIW, you can also "make V=1" to have it show you all this detail)
>> 
>> I’ll check that, to confirm that I’m correct about it not being passed.
>> 
>>  
>>  Noam
>> 
>> 
>> |
>> |
>> |
>> U.S. NAVAL
>> |
>> |
>> _RESEARCH_
>> |
>> LABORATORY
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628  F +1 202 404 7546
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nrl.navy.mil&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=NpYP1iUbEbTx87BW8Gx5ow&m=u1fQ9HzG1l1CRApve71dA4BBKPDM3lRS__c1Ev4h4bM&s=UevOrdYXRuu7JeDg4GBR5Y6tF0ZlSLkb-updK57HYTU&e=
>>  
> 
> 
> -- 
> Jeff

Re: [OMPI users] OpenMPI 4 and pmi2 support

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 12:55 PM, Carlson, Timothy S  
> wrote:
>  
> Just pass /usr to configure instead of /usr/include/slurm


This seems to have done it (as did passing CPPFLAGS, but this feels cleaner).  
Thank you all for the suggestions.

Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 10:42 AM, Noam Bernstein via users 
>  wrote:
> 
> I haven’t yet tried the latest OFED or Mellanox low level stuff.  That’s next 
> on my list, but slightly more involved to do, so I’ve been avoiding it.
> 

Aha - using Mellanox’s OFED packaging seems to essentially (if not 100%) fixed 
the issue.  There still appears to be some small leak, but it’s of order 1 GB, 
not 10s of GB, and it doesn’t grow continuously.   And on later runs of the 
same code it doesn’t grow any further, so whatever the kernel memory is being 
used for and not released, it can at least be used again for the same purpose.

Thanks for the nudge to check out this option.  Do you happen to know how the 
installer handles kernel updates?  Is it at all automated, or do I need to 
rerun the installer to build new kernel modules each time?

thanks,
Noam

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Nathan Hjelm via users

THAT is a good idea. When using Omnipath we see an issue with stale files in 
/dev/shm if the application exits abnormally. I don't know if UCX uses that 
space as well.


-Nathan

On June 20, 2019 at 11:05 AM, Joseph Schuchart via users 
 wrote:


Noam,

Another idea: check for stale files in /dev/shm/ (or a subdirectory that
looks like it belongs to UCX/OpenMPI) and SysV shared memory using `ipcs
-m`.

Joseph

On 6/20/19 3:31 PM, Noam Bernstein via users wrote:





On Jun 20, 2019, at 4:44 AM, Charles A Taylor mailto:chas...@ufl.edu>> wrote:


This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought
the fix was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x.
 Most of our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely
included the fix.


Unfortunately, 4.0.0 behaves the same.


One thing that I’m wondering if anyone familiar with the internals can
explain is how you get a memory leak that isn’t freed when then program
ends?  Doesn’t that suggest that it’s something lower level, like maybe
a kernel issue?


Noam



|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY


Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Jeff Squyres (jsquyres) via users
On Jun 20, 2019, at 1:34 PM, Noam Bernstein  wrote:
> 
> Aha - using Mellanox’s OFED packaging seems to essentially (if not 100%) 
> fixed the issue.  There still appears to be some small leak, but it’s of 
> order 1 GB, not 10s of GB, and it doesn’t grow continuously.   And on later 
> runs of the same code it doesn’t grow any further, so whatever the kernel 
> memory is being used for and not released, it can at least be used again for 
> the same purpose.
> 
> Thanks for the nudge to check out this option.  Do you happen to know how the 
> installer handles kernel updates?  Is it at all automated, or do I need to 
> rerun the installer to build new kernel modules each time?

I'm afraid I don't know anything about the Mellanox OFED installer; you'll need 
to check their documentation / check with them.

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Noam Bernstein via users
> On Jun 20, 2019, at 1:38 PM, Nathan Hjelm via users 
>  wrote:
> 
> THAT is a good idea. When using Omnipath we see an issue with stale files in 
> /dev/shm if the application exits abnormally. I don't know if UCX uses that 
> space as well.

No stale shm files.  echo 3 > /proc/sys/vm/drop_caches  doesn't do anything 
either.  But waiting a couple of minutes does cause the output of "free" to 
drop down to the normal idle level.  So not really a leak.


Anyway, thanks to everyone who gave suggestions.  For the moment I'm going to 
hope that the Mellanox OFED package will continue to work.  I've combined it 
with the SDSC Mellanox OFED roll, and it's a reasonably clean process (although 
I will have to do it (perhaps automated some day) for each kernel version).

Noam

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users