Re: [OMPI users] OPENIB unknown transport errors

2014-06-12 Thread Tim Miller
Aha ... looking at "ibv_devinfo -v" got me my first concrete hint of what's going on. On a node that's working fine (w2), under port 1 there is a line: LinkLayer: InfiniBand On a node that is having trouble (w3), that line is not present. The question is why this inconsistency occurs. I don't se

Re: [OMPI users] OPENIB unknown transport errors

2014-06-07 Thread Mike Dubman
could you please attach output of "ibv_devinfo -v" and "ofed_info -s" Thx On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller wrote: > Hi Josh, > > I asked one of our more advanced users to add the "-mca btl_openib_if_include > mlx4_0:1" argument to his job script. Unfortunately, the same error > occur

Re: [OMPI users] OPENIB unknown transport errors

2014-06-06 Thread Tim Miller
Hi Josh, I asked one of our more advanced users to add the "-mca btl_openib_if_include mlx4_0:1" argument to his job script. Unfortunately, the same error occurred as before. We'll keep digging on our end; if you have any other suggestions, please let us know. Tim On Thu, Jun 5, 2014 at 7:32 P

Re: [OMPI users] OPENIB unknown transport errors

2014-06-05 Thread Tim Miller
Hi Josh, Thanks for attempting to sort this out. In answer to your questions: 1. Node allocation is done by TORQUE, however we don't use the TM API to launch jobs (long story). Instead, we just pass a hostfile to mpirun, and mpirun uses the ssh launcher to actually communicate and launch the proc

Re: [OMPI users] OPENIB unknown transport errors

2014-06-05 Thread Joshua Ladd
Strange indeed. This info (remote adapter info) is passed around in the modex and the struct is locally populated during add procs. 1. How do you launch jobs? Mpirun, srun, or something else? 2. How many active ports do you have on each HCA? Are they all configured to use IB? 3. Do you explicitly

Re: [OMPI users] OPENIB unknown transport errors

2014-06-04 Thread Tim Miller
Hi, I'd like to revive this thread, since I am still periodically getting errors of this type. I have built 1.8.1 with --enable-debug and run with -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any additional information that I can find useful. I've gone ahead and attached

Re: [OMPI users] OPENIB unknown transport errors

2014-05-09 Thread Joshua Ladd
Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm wondering if this is an issue with the OOB. If you have a debug build, you can run -mca btl_openib_verbose 10 Josh On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd wrote: > Hi, Tim > > Run "ibstat" on each host: > > 1. Make sure

Re: [OMPI users] OPENIB unknown transport errors

2014-05-09 Thread Tim Miller
I've checked the links repeatedly with "ibstatus" and they look OK. Both nodes shoe a link layer of "InfiniBand". As I stated, everything works well with MVAPICH2, so I don't suspect a physical or link layer problem (but I could always be wrong on that). Tim On Fri, May 9, 2014 at 6:26 PM, Josh

Re: [OMPI users] OPENIB unknown transport errors

2014-05-09 Thread Joshua Ladd
Hi, Tim Run "ibstat" on each host: 1. Make sure the adapters are alive and active. 2. Look at the Link Layer settings for host w34. Does it match host w4's? Josh On Fri, May 9, 2014 at 1:18 PM, Tim Miller wrote: > Hi All, > > We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adap

[OMPI users] OPENIB unknown transport errors

2014-05-09 Thread Tim Miller
Hi All, We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adapters, and periodically our jobs abort at start-up with the following error: === Open MPI detected two different OpenFabrics transport types in the same Infiniband network. Such mixed network trasport configuration is not supp