date:20190925

Re: [OMPI users] silent failure for large allgather

2019-09-25 Thread Heinz, Michael William via users

Emmanuel Thomé,

Thanks for bringing this to our attention. It turns out this issue affects all 
OFI providers in open-mpi. We've applied a fix to the 3.0.x and later branches 
of open-mpi/ompi on github. However, you should be aware that this fix simply 
adds the appropriate error message, it does not allow OFI to support message 
sizes larger than the OFI provider actually supports. That will require a more 
significant effort which we are evaluating now.

---
Mike Heinz
Networking Fabric Software Engineer
Intel Corporation

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Do not use UCX for shared memory

2019-09-25 Thread Adrian Reber via users

Thanks, that works.

I also opened a UCX bug report and I already got a fix for it:

https://github.com/openucx/ucx/issues/4224

With that patch also UCX detects user namespace correctly.

Adrian

On Tue, Sep 24, 2019 at 12:12:54PM -0600, Nathan Hjelm wrote:
> You can use the uct btl if you want to come to use UCX but want Vader for 
> shared memory. Typical usage is --mca pml ob1 --mca osc ^ucx --mca btl 
> self,vader,uct --mca byl_uct_memory_domains ib/mlx5_0
> 
> -Nathan
> 
> > On Sep 24, 2019, at 11:13 AM, Adrian Reber via users 
> >  wrote:
> > 
> > Now that my PR to autodetect user namespaces has been merged in Open MPI
> > (thanks everyone for the help!) I tried running containers on UCX
> > enabled installation. The whole UCX setup confuses me a lot.
> > 
> > Is it possible with UCX enabled installation to tell Open MPI to use
> > vader for shared memory and not UCX? Because UCX seems to have similar
> > assumptions for its shared memory communication as vader, that processes
> > can talk to each other somehow:
> > 
> > mm_posix.c:445  UCX  ERROR Error returned from open in attach. Permission 
> > denied. File name is: /proc/24149/fd/16
> >   mm_ep.c:75   UCX  ERROR failed to connect to remote peer with mm. remote 
> > mm_id: 103719165231238
> > pml_ucx.c:383  Error: ucp_ep_create(proc=6) failed: Shared memory error
> > 
> > If I disable UCX '--mca pml ^ucx', shared memory communication works again,
> > but network based communication is not happening for also unknown reasons.
> > 
> > I tried to configure UCX with environment variables based on
> > https://github.com/openucx/ucx/wiki/UCX-environment-parameters
> > but that did not work.
> > 
> > So my question is, how can I use Open MPI with UCX, but vader for
> > local communication?
> > 
> > Everything I am doing uses user namespace based containers for every
> > process.
> > 
> >Adrian
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Singleton and Spawn

2019-09-25 Thread Martín Morales via users

Hi all! This is my first post. I'm newbie on Open MPI (and on MPI likewise!). I 
recently build the current version of this fabulous software (v4.0.1) on two 
Ubuntu 18 machines (a little part of our Beowulf Cluster). I already read (a 
lot) the FAQ and posts on the mail users list but I cant figure out how can I 
do this (if it can):  I need run my parallel programs without mpirun/exec 
commands; I need just one process (in my “master” machine) that will spawns 
processes dynamically (in the “slaves” machines). I already maked some dummies 
tests scripts and they works fine with  mpirun/exec commands. I set in  the 
MPI_Info_set the key “add-hostfile” with the file containing that 2 machines, 
that I mention before, with 4 slots each one. Nevertheless it doesn't work when 
I just run like a singleton program (e.g. ./spawnExample): it throws an error 
like this: “There are not enough slots available in the system to satisfy the 7 
slots that were requested by the application:...”. Here I try to start 8 
processes on the 2 machines. It seems that one process its executing fine on 
“master” and when it tries to spawns the other 7 it crashes.
We need this execution schema because we already have our software (used for 
scientific research) and we need to “incorporate” or “embed” Open MPI on it.
Thanks in advance guys!
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Steven Varga via users

As far as I know you have to wire up the connections among MPI clients,
allocate resources etc. PMIx is a library to set up all processes, and
shipped with openmpi.

The standard HPC method to launch tasks is through job schedulers such as
SLURM or GRID Engine. SLURM srun is very similar to mpirun: does the
resource allocations, then launches the jobs on allocated nodes and cores,
etc. It does this through PMIx library, or mpiexec.

When running mpiexec without integrated job manager, you are responsible
allocating recourses. See mpirun for details to pass host lists,
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ
or other Remote Procedure Calls -- it won't be simpler though.

Hope it helps:
Steve

On Wed, Sep 25, 2019, 13:15 Martín Morales via users, <
users@lists.open-mpi.org> wrote:

> Hi all! This is my first post. I'm newbie on Open MPI (and on MPI
> likewise!). I recently build the current version of this fabulous software
> (v4.0.1) on two Ubuntu 18 machines (a little part of our Beowulf Cluster).
> I already read (a lot) the FAQ and posts on the mail users list but I cant
> figure out how can I do this (if it can):  I need run my parallel programs
> without mpirun/exec commands; I need just one process (in my “master”
> machine) that will spawns processes dynamically (in the “slaves” machines).
> I already maked some dummies tests scripts and they works fine with
>  mpirun/exec commands. I set in  the MPI_Info_set the key “add-hostfile”
> with the file containing that 2 machines, that I mention before, with 4
> slots each one. Nevertheless it doesn't work when I just run like a
> singleton program (e.g. ./spawnExample): it throws an error like this:
> “There are not enough slots available in the system to satisfy the 7 slots
> that were requested by the application:...”. Here I try to start 8
> processes on the 2 machines. It seems that one process its executing fine
> on “master” and when it tries to spawns the other 7 it crashes.
> We need this execution schema because we already have our software (used
> for scientific research) and we need to “incorporate” or “embed” Open MPI
> on it.
> Thanks in advance guys!
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] UCX errors after upgrade

2019-09-25 Thread Raymond Muno via users


We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.

On our cluster, we were running CentOS 7.5 with updates, alongside 
MLNX_OFED 4.5.x.   OpenMPI was compiled with GCC, Intel, PGI and AOCC 
compilers. We could run with no issues.


To accommodate updates needed to get our IB gear all running at HDR100 
(EDR50 previously) we upgraded to CentOS 7.6.1810 and the current 
MLNX_OFED 4.6.x.


We can no longer reliably run on more than two nodes.

We see errors like:

[epyc-compute-3-2.local:42447] pml_ucx.c:380  Error: 
ucp_ep_create(proc=276) failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447  Error: Failed to resolve 
UCX endpoint for rank 276

[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process 
[47894553493505,47893180318004]

[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,

[epyc-compute-3-2:42447] ***    and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380  Error: 
ucp_ep_create(proc=147) failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447  Error: Failed to resolve 
UCX endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages


UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.

OpenMPI is is built on the same OS and MLNX_OFED, as we are running on 
the compute nodes.


I have a case open with Mellanox but it is not clear where this error is 
coming from.


--
 
 Ray Muno

 IT Manager
 e-mail:m...@aem.umn.edu
 Phone:   (612) 625-9531

  University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering
 110 Union St. S.E.  111 Church Street SE
 Minneapolis, MN 55455   Minneapolis, MN 55455

Re: [OMPI users] UCX errors after upgrade

2019-09-25 Thread Jeff Squyres (jsquyres) via users

Can you try the latest 4.0.2rc tarball?  We're very, very close to releasing 
v4.0.2...

I don't know if there's a specific UCX fix in there, but there are a ton of 
other good bug fixes in there since v4.0.1.


On Sep 25, 2019, at 2:12 PM, Raymond Muno via users 
mailto:users@lists.open-mpi.org>> wrote:


We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.

On our cluster, we were running CentOS 7.5 with updates, alongside MLNX_OFED 
4.5.x.   OpenMPI was compiled with GCC, Intel, PGI and AOCC compilers. We could 
run with no issues.

To accommodate updates needed to get our IB gear all running at HDR100 (EDR50 
previously) we upgraded to CentOS 7.6.1810 and the current MLNX_OFED 4.6.x.

We can no longer reliably run on more than two nodes.

We see errors like:

[epyc-compute-3-2.local:42447] pml_ucx.c:380  Error: ucp_ep_create(proc=276) 
failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447  Error: Failed to resolve UCX 
endpoint for rank 276
[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process [47894553493505,47893180318004]
[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[epyc-compute-3-2:42447] ***and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380  Error: ucp_ep_create(proc=147) 
failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447  Error: Failed to resolve UCX 
endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter "orte_base_help_aggregate" to 
0 to see all help / error messages

UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.

OpenMPI is is built on the same OS and MLNX_OFED, as we are running on the 
compute nodes.

I have a case open with Mellanox but it is not clear where this error is coming 
from.

--

 Ray Muno
 IT Manager
 e-mail:   m...@aem.umn.edu
 Phone:   (612) 625-9531

  University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering
 110 Union St. S.E.  111 Church Street SE
 Minneapolis, MN 55455   Minneapolis, MN 55455



--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Martín Morales via users

Thanks Steven. So, actually it can’t spawns from a singleton?


De: users  en nombre de Steven Varga via 
users 
Enviado: miércoles, 25 de septiembre de 2019 14:50
Para: Open MPI Users 
Cc: Steven Varga 
Asunto: Re: [OMPI users] Singleton and Spawn

As far as I know you have to wire up the connections among MPI clients, 
allocate resources etc. PMIx is a library to set up all processes, and shipped 
with openmpi.

The standard HPC method to launch tasks is through job schedulers such as SLURM 
or GRID Engine. SLURM srun is very similar to mpirun: does the resource 
allocations, then launches the jobs on allocated nodes and cores, etc. It does 
this through PMIx library, or mpiexec.

When running mpiexec without integrated job manager, you are responsible 
allocating recourses. See mpirun for details to pass host lists, 
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ or 
other Remote Procedure Calls -- it won't be simpler though.

Hope it helps:
Steve

On Wed, Sep 25, 2019, 13:15 Martín Morales via users, 
mailto:users@lists.open-mpi.org>> wrote:
Hi all! This is my first post. I'm newbie on Open MPI (and on MPI likewise!). I 
recently build the current version of this fabulous software (v4.0.1) on two 
Ubuntu 18 machines (a little part of our Beowulf Cluster). I already read (a 
lot) the FAQ and posts on the mail users list but I cant figure out how can I 
do this (if it can):  I need run my parallel programs without mpirun/exec 
commands; I need just one process (in my “master” machine) that will spawns 
processes dynamically (in the “slaves” machines). I already maked some dummies 
tests scripts and they works fine with  mpirun/exec commands. I set in  the 
MPI_Info_set the key “add-hostfile” with the file containing that 2 machines, 
that I mention before, with 4 slots each one. Nevertheless it doesn't work when 
I just run like a singleton program (e.g. ./spawnExample): it throws an error 
like this: “There are not enough slots available in the system to satisfy the 7 
slots that were requested by the application:...”. Here I try to start 8 
processes on the 2 machines. It seems that one process its executing fine on 
“master” and when it tries to spawns the other 7 it crashes.
We need this execution schema because we already have our software (used for 
scientific research) and we need to “incorporate” or “embed” Open MPI on it.
Thanks in advance guys!
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Ralph Castain via users

Yes, of course it can - however, I believe there is a bug in the add-hostfile
code path. We can address that problem far easier than moving to a different
interconnect.

On Sep 25, 2019, at 11:39 AM, Martín Morales via users
mailto:users@lists.open-mpi.org> > wrote:

Thanks Steven. So, actually it can’t spawns from a singleton?

De: users mailto:users-boun...@lists.open-mpi.org> > en nombre de Steven Varga via users
mailto:users@lists.open-mpi.org> >
Enviado: miércoles, 25 de septiembre de 2019 14:50
Para: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Steven Varga mailto:steven.va...@gmail.com> >
Asunto: Re: [OMPI users] Singleton and Spawn
As far as I know you have to wire up the connections among MPI clients,
allocate resources etc. PMIx is a library to set up all processes, and shipped
with openmpi.

The standard HPC method to launch tasks is through job schedulers such as SLURM
or GRID Engine. SLURM srun is very similar to mpirun: does the resource
allocations, then launches the jobs on allocated nodes and cores, etc. It does
this through PMIx library, or mpiexec.

When running mpiexec without integrated job manager, you are responsible
allocating recourses. See mpirun for details to pass host lists,
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ or
other Remote Procedure Calls -- it won't be simpler though.

Hope it helps:
Steve

On Wed, Sep 25, 2019, 13:15 Martín Morales via users, mailto:users@lists.open-mpi.org> > wrote:
Hi all! This is my first post. I'm newbie on Open MPI (and on MPI likewise!). I
recently build the current version of this fabulous software (v4.0.1) on two
Ubuntu 18 machines (a little part of our Beowulf Cluster). I already read (a
lot) the FAQ and posts on the mail users list but I cant figure out how can I
do this (if it can): I need run my parallel programs without mpirun/exec
commands; I need just one process (in my “master” machine) that will spawns
processes dynamically (in the “slaves” machines). I already maked some dummies
tests scripts and they works fine with mpirun/exec commands. I set in the
MPI_Info_set the key “add-hostfile” with the file containing that 2 machines,
that I mention before, with 4 slots each one. Nevertheless it doesn't work when
I just run like a singleton program (e.g. ./spawnExample): it throws an error
like this: “There are not enough slots available in the system to satisfy the 7
slots that were requested by the application:...”. Here I try to start 8
processes on the 2 machines. It seems that one process its executing fine on
“master” and when it tries to spawns the other 7 it crashes.
We need this execution schema because we already have our software (used for
scientific research) and we need to “incorporate” or “embed” Open MPI on it.
Thanks in advance guys!
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] UCX errors after upgrade

2019-09-25 Thread Raymond Muno via users

We are running against 4.0.2RC2 now. This is ussing current Intel 
compilers, version 2019update4. Still having issues.


[epyc-compute-1-3.local:17402] common_ucx.c:149  Warning: UCX is unable 
to handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:17669] common_ucx.c:149  Warning: UCX is unable 
to handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:17683] common_ucx.c:149  Warning: UCX is unable 
to handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:16626] pml_ucx.c:385  Error: 
ucp_ep_create(proc=265) failed: Destination is unreachable
[epyc-compute-1-3.local:16626] pml_ucx.c:452  Error: Failed to resolve 
UCX endpoint for rank 265

[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process 
[47001162088449,46999827120425]

[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,

[epyc-compute-1-3:16626] *** and potentially your MPI job)


On 9/25/19 1:28 PM, Jeff Squyres (jsquyres) via users wrote:
Can you try the latest 4.0.2rc tarball?  We're very, very close to 
releasing v4.0.2...


I don't know if there's a specific UCX fix in there, but there are a 
ton of other good bug fixes in there since v4.0.1.



On Sep 25, 2019, at 2:12 PM, Raymond Muno via users 
mailto:users@lists.open-mpi.org>> wrote:


We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.

On our cluster, we were running CentOS 7.5 with updates, alongside 
MLNX_OFED 4.5.x.   OpenMPI was compiled with GCC, Intel, PGI and AOCC 
compilers. We could run with no issues.


To accommodate updates needed to get our IB gear all running at 
HDR100 (EDR50 previously) we upgraded to CentOS 7.6.1810 and the 
current MLNX_OFED 4.6.x.


We can no longer reliably run on more than two nodes.

We see errors like:

[epyc-compute-3-2.local:42447] pml_ucx.c:380  Error: 
ucp_ep_create(proc=276) failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447  Error: Failed to 
resolve UCX endpoint for rank 276

[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process 
[47894553493505,47893180318004]

[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,

[epyc-compute-3-2:42447] ***    and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380  Error: 
ucp_ep_create(proc=147) failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447  Error: Failed to 
resolve UCX endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages


UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.

OpenMPI is is built on the same OS and MLNX_OFED, as we are running 
on the compute nodes.


I have a case open with Mellanox but it is not clear where this error 
is coming from.

--




--
Jeff Squyres
jsquy...@cisco.com 


--
 
 Ray Muno

 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering

Re: [OMPI users] UCX errors after upgrade

2019-09-25 Thread Raymond Muno via users

As a test, I rebooted a set of nodes. The user could run on 480 cores, 
on 5 nodes. We could not run beyond two nodes previous to that.


We still get the VM_UNMAP warning, however.

On 9/25/19 2:09 PM, Raymond Muno via users wrote:


We are running against 4.0.2RC2 now. This is ussing current Intel 
compilers, version 2019update4. Still having issues.


[epyc-compute-1-3.local:17402] common_ucx.c:149  Warning: UCX is 
unable to handle VM_UNMAP event. This may cause performance 
degradation or data corruption.
[epyc-compute-1-3.local:17669] common_ucx.c:149  Warning: UCX is 
unable to handle VM_UNMAP event. This may cause performance 
degradation or data corruption.
[epyc-compute-1-3.local:17683] common_ucx.c:149  Warning: UCX is 
unable to handle VM_UNMAP event. This may cause performance 
degradation or data corruption.
[epyc-compute-1-3.local:16626] pml_ucx.c:385  Error: 
ucp_ep_create(proc=265) failed: Destination is unreachable
[epyc-compute-1-3.local:16626] pml_ucx.c:452  Error: Failed to resolve 
UCX endpoint for rank 265

[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process 
[47001162088449,46999827120425]

[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,

[epyc-compute-1-3:16626] ***    and potentially your MPI job)




--
 
 Ray Muno

 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering

Re: [OMPI users] UCX errors after upgrade

2019-09-25 Thread Jeff Squyres (jsquyres) via users

Thanks Raymond; I have filed an issue for this on Github and tagged the 
relevant Mellanox people:

https://github.com/open-mpi/ompi/issues/7009


On Sep 25, 2019, at 3:09 PM, Raymond Muno via users 
mailto:users@lists.open-mpi.org>> wrote:


We are running against 4.0.2RC2 now. This is ussing current Intel compilers, 
version 2019update4. Still having issues.

[epyc-compute-1-3.local:17402] common_ucx.c:149  Warning: UCX is unable to 
handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:17669] common_ucx.c:149  Warning: UCX is unable to 
handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:17683] common_ucx.c:149  Warning: UCX is unable to 
handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:16626] pml_ucx.c:385  Error: ucp_ep_create(proc=265) 
failed: Destination is unreachable
[epyc-compute-1-3.local:16626] pml_ucx.c:452  Error: Failed to resolve UCX 
endpoint for rank 265
[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process [47001162088449,46999827120425]
[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[epyc-compute-1-3:16626] ***and potentially your MPI job)


On 9/25/19 1:28 PM, Jeff Squyres (jsquyres) via users wrote:
Can you try the latest 4.0.2rc tarball?  We're very, very close to releasing 
v4.0.2...

I don't know if there's a specific UCX fix in there, but there are a ton of 
other good bug fixes in there since v4.0.1.


On Sep 25, 2019, at 2:12 PM, Raymond Muno via users 
mailto:users@lists.open-mpi.org>> wrote:


We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.

On our cluster, we were running CentOS 7.5 with updates, alongside MLNX_OFED 
4.5.x.   OpenMPI was compiled with GCC, Intel, PGI and AOCC compilers. We could 
run with no issues.

To accommodate updates needed to get our IB gear all running at HDR100 (EDR50 
previously) we upgraded to CentOS 7.6.1810 and the current MLNX_OFED 4.6.x.

We can no longer reliably run on more than two nodes.

We see errors like:

[epyc-compute-3-2.local:42447] pml_ucx.c:380  Error: ucp_ep_create(proc=276) 
failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447  Error: Failed to resolve UCX 
endpoint for rank 276
[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process [47894553493505,47893180318004]
[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[epyc-compute-3-2:42447] ***and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380  Error: ucp_ep_create(proc=147) 
failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447  Error: Failed to resolve UCX 
endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter "orte_base_help_aggregate" to 
0 to see all help / error messages

UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.

OpenMPI is is built on the same OS and MLNX_OFED, as we are running on the 
compute nodes.

I have a case open with Mellanox but it is not clear where this error is coming 
from.

--




--
Jeff Squyres
jsquy...@cisco.com


--

 Ray Muno
 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics Mechanical Engineering




--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Martín Morales via users

Thanks Ralph, but if I have a wrong hostfile path in my MPI_Comm_spawn 
function, why it works if I run with mpirun (Eg. mpirun -np 1 ./spawnExample)?

De: Ralph Castain 
Enviado: miércoles, 25 de septiembre de 2019 15:42
Para: Open MPI Users 
Cc: steven.va...@gmail.com ; Martín Morales 

Asunto: Re: [OMPI users] Singleton and Spawn

Yes, of course it can - however, I believe there is a bug in the add-hostfile 
code path. We can address that problem far easier than moving to a different 
interconnect.


On Sep 25, 2019, at 11:39 AM, Martín Morales via users 
mailto:users@lists.open-mpi.org>> wrote:

Thanks Steven. So, actually it can’t spawns from a singleton?


De: users 
mailto:users-boun...@lists.open-mpi.org>> en 
nombre de Steven Varga via users 
mailto:users@lists.open-mpi.org>>
Enviado: miércoles, 25 de septiembre de 2019 14:50
Para: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Steven Varga mailto:steven.va...@gmail.com>>
Asunto: Re: [OMPI users] Singleton and Spawn

As far as I know you have to wire up the connections among MPI clients, 
allocate resources etc. PMIx is a library to set up all processes, and shipped 
with openmpi.

The standard HPC method to launch tasks is through job schedulers such as SLURM 
or GRID Engine. SLURM srun is very similar to mpirun: does the resource 
allocations, then launches the jobs on allocated nodes and cores, etc. It does 
this through PMIx library, or mpiexec.

When running mpiexec without integrated job manager, you are responsible 
allocating recourses. See mpirun for details to pass host lists, 
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ or 
other Remote Procedure Calls -- it won't be simpler though.

Hope it helps:
Steve

On Wed, Sep 25, 2019, 13:15 Martín Morales via users, 
mailto:users@lists.open-mpi.org>> wrote:
Hi all! This is my first post. I'm newbie on Open MPI (and on MPI likewise!). I 
recently build the current version of this fabulous software (v4.0.1) on two 
Ubuntu 18 machines (a little part of our Beowulf Cluster). I already read (a 
lot) the FAQ and posts on the mail users list but I cant figure out how can I 
do this (if it can):  I need run my parallel programs without mpirun/exec 
commands; I need just one process (in my “master” machine) that will spawns 
processes dynamically (in the “slaves” machines). I already maked some dummies 
tests scripts and they works fine with  mpirun/exec commands. I set in  the 
MPI_Info_set the key “add-hostfile” with the file containing that 2 machines, 
that I mention before, with 4 slots each one. Nevertheless it doesn't work when 
I just run like a singleton program (e.g. ./spawnExample): it throws an error 
like this: “There are not enough slots available in the system to satisfy the 7 
slots that were requested by the application:...”. Here I try to start 8 
processes on the 2 machines. It seems that one process its executing fine on 
“master” and when it tries to spawns the other 7 it crashes.
We need this execution schema because we already have our software (used for 
scientific research) and we need to “incorporate” or “embed” Open MPI on it.
Thanks in advance guys!
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Singleton and Spawn

2019-09-25 Thread Ralph Castain via users

It's a different code path, that's all - just a question of what path gets
traversed.

Would you mind posting a little more info on your two use-cases? For example,
do you have a default hostfile telling mpirun what machines to use?

On Sep 25, 2019, at 12:41 PM, Martín Morales mailto:martineduardomora...@hotmail.com> > wrote:

Thanks Ralph, but if I have a wrong hostfile path in my MPI_Comm_spawn
function, why it works if I run with mpirun (Eg. mpirun -np 1 ./spawnExample)?

De: Ralph Castain mailto:r...@open-mpi.org> >
Enviado: miércoles, 25 de septiembre de 2019 15:42
Para: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: steven.va...@gmail.com
mailto:steven.va...@gmail.com> >; Martín Morales
mailto:martineduardomora...@hotmail.com> >
Asunto: Re: [OMPI users] Singleton and Spawn
Yes, of course it can - however, I believe there is a bug in the add-hostfile
code path. We can address that problem far easier than moving to a different
interconnect.

On Sep 25, 2019, at 11:39 AM, Martín Morales via users
mailto:users@lists.open-mpi.org> > wrote:

Thanks Steven. So, actually it can’t spawns from a singleton?

When running mpiexec without integrated job manager, you are responsible
allocating recourses. See mpirun for details to pass host lists,
oversubscription etc.

If you are looking for a different, not MPI based interconnect, try ZeroMQ or
other Remote Procedure Calls -- it won't be simpler though.

Hope it helps:
Steve

Re: [OMPI users] silent failure for large allgather

Re: [OMPI users] Do not use UCX for shared memory

[OMPI users] Singleton and Spawn

Re: [OMPI users] Singleton and Spawn

[OMPI users] UCX errors after upgrade

Re: [OMPI users] UCX errors after upgrade

Re: [OMPI users] Singleton and Spawn

Re: [OMPI users] Singleton and Spawn

Re: [OMPI users] UCX errors after upgrade

Re: [OMPI users] UCX errors after upgrade

Re: [OMPI users] UCX errors after upgrade

Re: [OMPI users] Singleton and Spawn

Re: [OMPI users] Singleton and Spawn

13 matches

Site Navigation

Mail list logo

Footer information