Thanks to Jeff, we now have a bug registered with the segv issue.

Now we are moving to the testing with a myrinet onboard, but we are
also having some unexpected issues.

1. If we don't specify which BTL to use, my understanding is that it
should pick up the mx btl if openib is not available. However it
doesn't look like this is the case. So it falls back to tcp from the
following example.

[y...@n0026.hbar]$ mpirun -np 2 -H n0026.hbar,n0027.hbar ./osu_latency
CMA: no RDMA devices found
--------------------------------------------------------------------------
[[59254,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: n0026.hbar

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
CMA: no RDMA devices found
# OSU MPI Latency Test (Version 2.0)
# Size      Latency (us)
0           45.86
1           46.57
2           46.61
4           46.79
...

2. If we specify the mx btl, it spits an error although it seems to
run fine after that.

[y...@n0026.hbar]$ mpirun -np 2 -H n0026.hbar,n0027.hbar -mca btl
mx,sm,self ./osu_latency
mx_finalize() called while some endpoints are still open.
[n0026.hbar:02045] Error in mx_finalize (error Busy)
mx_finalize() called while some endpoints are still open.
[n0027.hbar:02175] Error in mx_finalize (error Busy)
# OSU MPI Latency Test (Version 2.0)
# Size      Latency (us)
0           3.85
1           4.27
2           4.27
4           4.31
...

3. If we disable openib btl, it falls back to #2.

[y...@n0026.hbar]$ mpirun -np 2 -H n0026.hbar,n0027.hbar -mca btl
^openib ./osu_latency
mx_finalize() called while some endpoints are still open.
[n0026.hbar:02051] Error in mx_finalize (error Busy)
mx_finalize() called while some endpoints are still open.
[n0027.hbar:02198] Error in mx_finalize (error Busy)
# OSU MPI Latency Test (Version 2.0)
# Size      Latency (us)
0           3.84
1           4.24
2           4.25
4           4.29
...

Note this is also different behavior than the old gm btl because we
have always had both btl's built in and the proper one will be picked
up based on which device is actually onboard. Can anybody whether this
is a also a bug, or some parameters that we can set so that it will do
the select automatically? Also that error message "mx_finalize"
doesn't look right either.

Thanks,

Yong

On Fri, Jun 15, 2012 at 6:41 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> On Jun 11, 2012, at 7:48 PM, Yong Qin wrote:
>
>> ah, I guess my original understanding of PML was wrong. Adding "-mca
>> pml ob1" does help to ease the problem.
>
> See the README for a little more discussion about this issue.  There can only 
> be 1 PML in use by a given MPI job -- using "--mca pml ob1" forces the use of 
> the "ob1" PML (i.e., the BTLs), as opposed to the "cm" MTL (i.e., the MTLs).
>
>> But the question still
>> remains. Why ompi decided to use the mx BTL in the first place, given
>> there's no physical device onboard at all? This behavior is completely
>> different than the original gm BTL.
>
> That's not what is actually happening.
>
> Open MPI *built* with MX support, and it therefore assumes that you will 
> likely want to use it.  So it *warns* you when there is no MX device 
> available.
>
> That being said, I have recently run into the issue you are seeing: if OMPI 
> 1.6 warns you that there is no high-speed device available (openib in my 
> case), it then segv's (which it obviously shouldn't -- it should warn and 
> then die gracefully).  I'll open a ticket on this behavior.  It's not a 
> common scenario, but we still shouldn't segv.
>
> My first guess is that this has something to do with the memory manager... 
> but that's a guess.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to