Aha ... looking at "ibv_devinfo -v" got me my first concrete hint of what's
going on. On a node that's working fine (w2), under port 1 there is a line:
LinkLayer: InfiniBand
On a node that is having trouble (w3), that line is not present. The
question is why this inconsistency occurs.
I don't se
could you please attach output of "ibv_devinfo -v" and "ofed_info -s"
Thx
On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller wrote:
> Hi Josh,
>
> I asked one of our more advanced users to add the "-mca btl_openib_if_include
> mlx4_0:1" argument to his job script. Unfortunately, the same error
> occur
Hi Josh,
I asked one of our more advanced users to add the "-mca btl_openib_if_include
mlx4_0:1" argument to his job script. Unfortunately, the same error
occurred as before.
We'll keep digging on our end; if you have any other suggestions, please
let us know.
Tim
On Thu, Jun 5, 2014 at 7:32 P
Hi Josh,
Thanks for attempting to sort this out. In answer to your questions:
1. Node allocation is done by TORQUE, however we don't use the TM API to
launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
mpirun uses the ssh launcher to actually communicate and launch the
proc
Strange indeed. This info (remote adapter info) is passed around in the
modex and the struct is locally populated during add procs.
1. How do you launch jobs? Mpirun, srun, or something else?
2. How many active ports do you have on each HCA? Are they all configured
to use IB?
3. Do you explicitly
Hi,
I'd like to revive this thread, since I am still periodically getting
errors of this type. I have built 1.8.1 with --enable-debug and run with
-mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
additional information that I can find useful. I've gone ahead and attached
Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm
wondering if this is an issue with the OOB. If you have a debug build, you
can run -mca btl_openib_verbose 10
Josh
On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd wrote:
> Hi, Tim
>
> Run "ibstat" on each host:
>
> 1. Make sure
I've checked the links repeatedly with "ibstatus" and they look OK. Both
nodes shoe a link layer of "InfiniBand".
As I stated, everything works well with MVAPICH2, so I don't suspect a
physical or link layer problem (but I could always be wrong on that).
Tim
On Fri, May 9, 2014 at 6:26 PM, Josh
Hi, Tim
Run "ibstat" on each host:
1. Make sure the adapters are alive and active.
2. Look at the Link Layer settings for host w34. Does it match host w4's?
Josh
On Fri, May 9, 2014 at 1:18 PM, Tim Miller wrote:
> Hi All,
>
> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adap
Hi All,
We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adapters, and
periodically our jobs abort at start-up with the following error:
===
Open MPI detected two different OpenFabrics transport types in the same
Infiniband network.
Such mixed network trasport configuration is not supp
10 matches
Mail list logo