Re: [OMPI users] problem starting a ompi job in a mix BE/LE cluster

Steve Wise Tue, 2 Jun 2015 13:00:49 -0400 (EDT)

That's fine. But any pointers on where to start would be helpful andappreciated.


On 6/2/2015 10:15 AM, Jeff Squyres (jsquyres) wrote:

Steve --

I think that this falls directly in your prevue since you volunteered to 
maintain the openib BTL (this HCA ID thing is part of the openib BTL 
bootstrapping).  :-)

On Jun 2, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote:

On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

On 6/1/2015 9:51 PM, Ralph Castain wrote:

I’m wondering if it is also possible that the error message is simply printing 
that ID incorrectly. Looking at the code, it appears that we do perform the 
network byte translation correctly when we setup the data for transmission 
between the processes. However, I don’t see that translation being done before 
we print the error message. Hence, I think the error message is printing out 
the device ID incorrectly - and the problem truly is just that the queues are 
different.

Does the code convert the device id/part number into HBO before looking it up 
in the .ini file?

All I could see was that it is converted to NBO for transmission, and to HBO at 
the remote end for use.  So both sides should have accurate IDs. I don’t know 
what happens beyond that, I’m afraid - this isn’t my particular code area.

Assuming atlas3 is just displaying the vendor and part numbers w/o converting 
to HBO, they do look correct.  part ID 21505 is 0x5401, and part ID 22282240 is 
0x5401 swapped:

[root@atlas3 openmpi]# echo $((0x5401))
21505
[root@atlas3 openmpi]# echo $((0x01540000))
22282240

Looking at the .ini on both nodes however, I see valid and identical entries 
for device 0x1425/0x5401:

[root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini

[Chelsio T5]
vendor_id = 0x1425
vendor_part_id = 
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64

[root@atlas3 openmpi]# grep -3 0x5401 *ini

[Chelsio T5]
vendor_id = 0x1425
vendor_part_id = 
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64

So I still think somehow the one node is looking up the wrong entry in the .ini 
file.

Also:  Attached are the ompi-info outputs and a diff of the two.

Steve.

On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:

This is not a heterogeneous run-time issue -- it's the issue that Nathan cited: 
that OMPI detected different receive queue setups on different machines.

As the error message states; the openib BTL simply cannot handle when different 
MPI processes specific different receive queue specifications.

You mentioned that the device ID is being incorrectly identified: is that 
OMPI's fault, or something wrong with the device itself?

On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote:

On 6/1/2015 9:53 AM, Ralph Castain wrote:

Well, I checked and it looks to me like —hetero-apps is a stale option in the 
master at least - I don’t see where it gets used.

Looking at the code, I would suspect that something didn’t get configured 
correctly - either the —enable-heterogeneous flag didn’t get set on one side, 
or we incorrectly failed to identify the BE machine, or both. You might run 
ompi_info on the two sides and verify they both were built correctly

We'll check ompi_info...

Thanks!

Steve.
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27025.php

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27026.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27027.php

<atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27030.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27031.php

Re: [OMPI users] problem starting a ompi job in a mix BE/LE cluster

Reply via email to