Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Wonderful! I am happy to confirm that this resolves the issue! Many thanks to everyone for their assistance, Collin

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, that nailed it down - the problem is the number of open file descriptors is exceeding your system limit. I suspect the connection to the Mellanox drivers is solely due to it also having some descriptors open, and you are just close enough to the boundary that it causes you to hit it. See

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I agree that it is odd that the issue does not appear until after the Mellanox drivers have been installed (and the configure flags set to use them). As requested, here are the results Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10 hostname Output: [Gen2Node3:54

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, debug-daemons isn't going to help as we aren't launching any daemons. This is all one node. So try adding "--mca odls_base_verbose 10 --mca state_base_verbose 10" to the cmd line and let's see what is going on. I agree with Josh - neither mpirun nor hostname are invoking the Mellanox driv

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Same result. (It works though 102 but not greater than that) Input: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Output: [Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs [Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd [Gen2Node3:54348] [[18008,

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
OK. Please try: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Josh On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname > > > > Output: > > [Gen2Node3:54039] [[16643,0],0] orted_c

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname Output: [Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs [Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd [Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - exiting --

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Sorry, typo, try: mpirun -np 128 --debug-daemons -mca plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd wrote: > And if you try: > mpirun -np 128 --debug-daemons -plm rsh hostname > > Josh > > On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger < > cstrassbur...@bihrle.com> w

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
And if you try: mpirun -np 128 --debug-daemons -plm rsh hostname Josh On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Input: mpirun -np 128 --debug-daemons hostname > > > > Output: > > [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Input: mpirun -np 128 --debug-daemons hostname Output: [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - exiting -

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > In relation to the multi-node attempt, I haven’t yet set that up yet as > the per-node configuration doesn’t pass its tests (full node

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
In relation to the multi-node attempt, I haven’t yet set that up yet as the per-node configuration doesn’t pass its tests (full node utilization, etc). Here are the results for the hostname test: Input: mpirun -np 128 hostname Output: -

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com> > wrote: I don't see how this can be diagnosed as a "problem with the Mellano

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Also, can you try running: mpirun -np 128 hostname Josh On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote: > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than one node? > > Josh > > O

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Here’s the I/O for these high local

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < users@lists.open-mpi.org> wrote: > Okay, so this is a problem with the Mellanox software - copying Artem. > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger > w

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related de

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello,  I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs in

Re: [OMPI users] OpenMPI 4.0.2 with PGI 19.10, will not build with hcoll

2020-01-28 Thread Ray Muno via users
I opened a case with pgroup support regarding this. We are also using Slurm along with HCOLL. -Ray Muno On 1/26/20 5:52 AM, Åke Sandgren via users wrote: Note that when built against SLURM it will pick up pthread from libslurm.la too. On 1/26/20 4:37 AM, Gilles Gouaillardet via users wrote:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Hello, I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs installed. I am have done two runs with Mellanox and pbs installed. One run includes the actual run options I will be using while the other includes a truncated set which still co