Re: [OMPI users] opal_path_nfs freeze

2020-04-23 Thread Patrick Bégou via users
Hi Jeff

As we say in french "dans le mille!" you were right.
I'm not the admin of these servers and a "mpirun not found" was
sufficient in my mind. It wasn't.

As I had deployed OpenMPI 4.0.2 I launch a new build after setting my
LD_LIBRARY_PATH to reach OpenMPI4.0.2 installed libs before all other
locations and all tests were successfull.

I think that this should be modified in the test script as we usually
run "make check" before "make install". Setting properly LD_LIBRARY_PATH
to reach first the temporary directory were the libs are built before
launching the test would be enought to avoid this wrong behavior.

I do not wait for an hour in front of my keyboard :-D, it was lunch time
and I was thinking of some timeout problem as NFS means... network!

Thanks a lot for providing the solution so quickly.

Patrick

Le 22/04/2020 à 20:17, Jeff Squyres (jsquyres) a écrit :
> The test should only take a few moments; no need to let it sit for a
> full hour.
>
> I have seen this kind of behavior before if you have an Open MPI
> installation in your PATH / LD_LIBRARY_PATH already, and then you
> invoke "make check".
>
> Because the libraries may be the same name and/or .so version numbers,
> there may be confusion in the tests setup scripts about exactly which
> libraries to use (the installed versions or the ones you just built /
> are trying to test).
>
> This is a long way of saying: make sure that you have no other Open
> MPI installation findable in your PATH / LD_LIBRARY_PATH and then try
> running `make check` again.
>
>
>> On Apr 21, 2020, at 2:37 PM, Patrick Bégou via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> Hi OpenMPI maintainers,
>>
>>
>> I have temporary access to servers with AMD Epyc processors running
>> RHEL7.
>>
>> I'm trying to deploy OpenMPI with several setup but each time "make
>> check" fails on *opal_path_nfs*. This test freeze for ever consuming
>> no cpu resources.
>>
>> After nearly one hour I have killed the process.
>>
>> *_In test-suite.log I have:_*
>>
>> 
>>    Open MPI v3.1.x-201810100324-c8e9819: test/util/test-suite.log
>> 
>>
>> # TOTAL: 3
>> # PASS:  2
>> # SKIP:  0
>> # XFAIL: 0
>> # FAIL:  1
>> # XPASS: 0
>> # ERROR: 0
>>
>> .. contents:: :depth: 2
>>
>> FAIL: opal_path_nfs
>> ===
>>
>> FAIL opal_path_nfs (exit status: 137)
>>
>>
>> _*In opal_path_nfs.out I have a list of path:*_
>>
>> /proc proc
>> /sys sysfs
>> /dev devtmpfs
>> /run tmpfs
>> / xfs
>> /sys/kernel/security securityfs
>> /dev/shm tmpfs
>> /dev/pts devpts
>> /sys/fs/cgroup tmpfs
>> /sys/fs/cgroup/systemd cgroup
>> /sys/fs/pstore pstore
>> /sys/firmware/efi/efivars efivarfs
>> /sys/fs/cgroup/hugetlb cgroup
>> /sys/fs/cgroup/pids cgroup
>> /sys/fs/cgroup/net_cls,net_prio cgroup
>> /sys/fs/cgroup/devices cgroup
>> /sys/fs/cgroup/cpu,cpuacct cgroup
>> /sys/fs/cgroup/freezer cgroup
>> /sys/fs/cgroup/perf_event cgroup
>> /sys/fs/cgroup/cpuset cgroup
>> /sys/fs/cgroup/memory cgroup
>> /sys/fs/cgroup/blkio cgroup
>> /proc/sys/fs/binfmt_misc autofs
>> /sys/kernel/debug debugfs
>> /dev/hugepages hugetlbfs
>> /dev/mqueue mqueue
>> /sys/kernel/config configfs
>> /proc/sys/fs/binfmt_misc binfmt_misc
>> /boot/efi vfat
>> /local xfs
>> /var xfs
>> /tmp xfs
>> /var/lib/nfs/rpc_pipefs rpc_pipefs
>> /home nfs
>> /cm/shared nfs
>> /scratch nfs
>> /run/user/1013 tmpfs
>> /run/user/1010 tmpfs
>> /run/user/1046 tmpfs
>> /run/user/1015 tmpfs
>> /run/user/1121 tmpfs
>> /run/user/1113 tmpfs
>> /run/user/1126 tmpfs
>> /run/user/1002 tmpfs
>> /run/user/1130 tmpfs
>> /run/user/1004 tmpfs
>>
>> _*In opal_path_nfs.log:*_
>>
>> FAIL opal_path_nfs (exit status: 137)
>>
>>
>> The compiler is GCC9.2.
>>
>> I've also tested openmpi-4.0.3 built with gcc 8.2. Same problem.
>>
>> Thanks for your help.
>>
>> Patrick
>>
>>
>
>
> -- 
> Jeff Squyres
> jsquy...@cisco.com 
>



[OMPI users] Can't start jobs with srun.

2020-04-23 Thread Prentice Bisbal via users
I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
with a very simple hello, world MPI program that I've used reliably for 
years. When I submit the job through slurm and use srun to launch the 
job, I get these errors:


*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and 
not able to guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and 
not able to guarantee that all other processes were killed!


If I run the same job, but use mpiexec or mpirun instead of srun, the 
jobs run just fine. I checked ompi_info to make sure OpenMPI was 
compiled with  Slurm support:


$ ompi_info | grep slurm
  Configure command line: 
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'

 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice



Re: [OMPI users] Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
Is Slurm built with PMIx support? Did you tell srun to use it?


> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>  wrote:
> 
> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
> very simple hello, world MPI program that I've used reliably for years. When 
> I submit the job through slurm and use srun to launch the job, I get these 
> errors:
> 
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> 
> If I run the same job, but use mpiexec or mpirun instead of srun, the jobs 
> run just fine. I checked ompi_info to make sure OpenMPI was compiled with  
> Slurm support:
> 
> $ ompi_info | grep slurm
>   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
> '--with-slurm' '--with-psm'
>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
> 
> Any ideas what could be wrong? Do you need any additional information?
> 
> Prentice
> 




Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Prentice Bisbal via users

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice





Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.


> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>  wrote:
> 
> It looks like it was built with PMI2, but not PMIx:
> 
> $ srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: pmi2
> srun: openmpi
> 
> I did launch the job with srun --mpi=pmi2 
> 
> Does OpenMPI 4 need PMIx specifically?
> 
> 
> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>> Is Slurm built with PMIx support? Did you tell srun to use it?
>> 
>> 
>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>>  wrote:
>>> 
>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
>>> very simple hello, world MPI program that I've used reliably for years. 
>>> When I submit the job through slurm and use srun to launch the job, I get 
>>> these errors:
>>> 
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
>>> successfully, but am not able to aggregate error messages, and not able to 
>>> guarantee that all other processes were killed!
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
>>> successfully, but am not able to aggregate error messages, and not able to 
>>> guarantee that all other processes were killed!
>>> 
>>> If I run the same job, but use mpiexec or mpirun instead of srun, the jobs 
>>> run just fine. I checked ompi_info to make sure OpenMPI was compiled with  
>>> Slurm support:
>>> 
>>> $ ompi_info | grep slurm
>>>   Configure command line: 
>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' '--disable-silent-rules' 
>>> '--enable-shared' '--with-pmix=internal' '--with-slurm' '--with-psm'
>>>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>>>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
>>> 
>>> Any ideas what could be wrong? Do you need any additional information?
>>> 
>>> Prentice
>>> 
>> 




Re: [OMPI users] opal_path_nfs freeze

2020-04-23 Thread Jeff Squyres (jsquyres) via users
On Apr 23, 2020, at 8:50 AM, Patrick Bégou  
wrote:
> 
> As we say in french "dans le mille!" you were right.
> I'm not the admin of these servers and a "mpirun not found" was sufficient in 
> my mind. It wasn't.
> 
> As I had deployed OpenMPI 4.0.2 I launch a new build after setting my 
> LD_LIBRARY_PATH to reach OpenMPI4.0.2 installed libs before all other 
> locations and all tests were successfull.

Sweet!  Glad you got it worked out.

> I think that this should be modified in the test script as we usually run 
> "make check" before "make install". Setting properly LD_LIBRARY_PATH to reach 
> first the temporary directory were the libs are built before launching the 
> test would be enought to avoid this wrong behavior.

It's a little hard for us to do this, for a few reasons:

- we can't just set LD_LIBRARY_PATH to empty, because we can't know what else 
is needed from LD_LIBRARY_PATH to run the tests (e.g., network stack libraries, 
other support libraries / dependencies)
- the "make check" infrastructure is wholly provided by GNU Automake.  It takes 
care of setting LD_LIBRARY_PATH to the Right Places in the build tree to point 
to the just-built libraries.  I don't know how we'd hijack / patch that.
- it's even fairly tricky to write a negative test (e.g., check to see if the 
"wrong" libmpi/etc. will be found), because which library(ies) are used change 
depending on which test is used, and re-creating the .so version number rules 
in a "look for a match" script could be kinda difficult (and OS-specific).

-- 
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Prentice Bisbal via users
--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= 
to either of them, my job still fails. Why is that? Can I not trust the 
output of --mpi=list?


Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is what 
you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
 wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice





Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2


> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
>  wrote:
> 
> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
> either of them, my job still fails. Why is that? Can I not trust the output 
> of --mpi=list?
> 
> Prentice
> 
> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>> No, but you do have to explicitly build OMPI with non-PMIx support if that 
>> is what you are going to use. In this case, you need to configure OMPI 
>> --with-pmi2=
>> 
>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed 
>> in a standard location as we should find it there.
>> 
>> 
>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>>>  wrote:
>>> 
>>> It looks like it was built with PMI2, but not PMIx:
>>> 
>>> $ srun --mpi=list
>>> srun: MPI types are...
>>> srun: none
>>> srun: pmi2
>>> srun: openmpi
>>> 
>>> I did launch the job with srun --mpi=pmi2 
>>> 
>>> Does OpenMPI 4 need PMIx specifically?
>>> 
>>> 
>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
 Is Slurm built with PMIx support? Did you tell srun to use it?
 
 
> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>  wrote:
> 
> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with 
> a very simple hello, world MPI program that I've used reliably for years. 
> When I submit the job through slurm and use srun to launch the job, I get 
> these errors:
> 
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
> completed successfully, but am not able to aggregate error messages, and 
> not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
> completed successfully, but am not able to aggregate error messages, and 
> not able to guarantee that all other processes were killed!
> 
> If I run the same job, but use mpiexec or mpirun instead of srun, the 
> jobs run just fine. I checked ompi_info to make sure OpenMPI was compiled 
> with  Slurm support:
> 
> $ ompi_info | grep slurm
>   Configure command line: 
> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
> '--with-slurm' '--with-psm'
>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
> 
> Any ideas what could be wrong? Do you need any additional information?
> 
> Prentice
> 
>>