Wonderful! I am happy to confirm that this resolves the issue!
Many thanks to everyone for their assistance,
Collin
Okay, that nailed it down - the problem is the number of open file descriptors
is exceeding your system limit. I suspect the connection to the Mellanox
drivers is solely due to it also having some descriptors open, and you are just
close enough to the boundary that it causes you to hit it.
See
I agree that it is odd that the issue does not appear until after the Mellanox
drivers have been installed (and the configure flags set to use them). As
requested, here are the results
Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10
hostname
Output:
[Gen2Node3:54
Okay, debug-daemons isn't going to help as we aren't launching any daemons.
This is all one node. So try adding "--mca odls_base_verbose 10 --mca
state_base_verbose 10" to the cmd line and let's see what is going on.
I agree with Josh - neither mpirun nor hostname are invoking the Mellanox
driv
Same result. (It works though 102 but not greater than that)
Input: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname
Output:
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd
[Gen2Node3:54348] [[18008,
OK. Please try:
mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname
Josh
On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:
> Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname
>
>
>
> Output:
>
> [Gen2Node3:54039] [[16643,0],0] orted_c
Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname
Output:
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd
[Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone -
exiting
--
Sorry, typo, try:
mpirun -np 128 --debug-daemons -mca plm rsh hostname
Josh
On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd wrote:
> And if you try:
> mpirun -np 128 --debug-daemons -plm rsh hostname
>
> Josh
>
> On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger <
> cstrassbur...@bihrle.com> w
And if you try:
mpirun -np 128 --debug-daemons -plm rsh hostname
Josh
On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:
> Input: mpirun -np 128 --debug-daemons hostname
>
>
>
> Output:
>
> [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
Input: mpirun -np 128 --debug-daemons hostname
Output:
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd
[Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone -
exiting
-
Interesting. Can you try:
mpirun -np 128 --debug-daemons hostname
Josh
On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:
> In relation to the multi-node attempt, I haven’t yet set that up yet as
> the per-node configuration doesn’t pass its tests (full node
In relation to the multi-node attempt, I haven’t yet set that up yet as the
per-node configuration doesn’t pass its tests (full node utilization, etc).
Here are the results for the hostname test:
Input: mpirun -np 128 hostname
Output:
-
Josh - if you read thru the thread, you will see that disabling Mellanox/IB
drivers allows the program to run. It only fails when they are enabled.
On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com> > wrote:
I don't see how this can be diagnosed as a "problem with the Mellano
Also, can you try running:
mpirun -np 128 hostname
Josh
On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote:
> I don't see how this can be diagnosed as a "problem with the Mellanox
> Software". This is on a single node. What happens when you try to launch on
> more than one node?
>
> Josh
>
> O
I don't see how this can be diagnosed as a "problem with the Mellanox
Software". This is on a single node. What happens when you try to launch on
more than one node?
Josh
On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:
> Here’s the I/O for these high local
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard
hpcg benchmark)
Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:
Can you send the output of a failed run including your command line.
Josh
On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:
> Okay, so this is a problem with the Mellanox software - copying Artem.
>
> On Jan 28, 2020, at 8:15 AM, Collin Strassburger
> w
Okay, so this is a problem with the Mellanox software - copying Artem.
On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote:
I just tried that and it does indeed work with pbs and without Mellanox (until
a reboot makes it complain about Mellanox/IB related de
I just tried that and it does indeed work with pbs and without Mellanox (until
a reboot makes it complain about Mellanox/IB related defaults as no drivers
were installed, etc).
After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx
Does it work with pbs but not Mellanox? Just trying to isolate the problem.
On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users
mailto:users@lists.open-mpi.org> > wrote:
Hello,
I have done some additional testing and I can say that it works correctly with
gcc8 and no mellanox or pbs in
I opened a case with pgroup support regarding this.
We are also using Slurm along with HCOLL.
-Ray Muno
On 1/26/20 5:52 AM, Åke Sandgren via users wrote:
Note that when built against SLURM it will pick up pthread from
libslurm.la too.
On 1/26/20 4:37 AM, Gilles Gouaillardet via users wrote:
Hello,
I have done some additional testing and I can say that it works correctly with
gcc8 and no mellanox or pbs installed.
I am have done two runs with Mellanox and pbs installed. One run includes the
actual run options I will be using while the other includes a truncated set
which still co
22 matches
Mail list logo