On 6/2/2015 10:04 AM, Ralph Castain wrote:
On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com
<mailto:sw...@opengridcomputing.com>> wrote:
On 6/1/2015 9:51 PM, Ralph Castain wrote:
I’m wondering if it is also possible that the error message is
simply printing that ID incorrectly. Looking at the code, it appears
that we do perform the network byte translation correctly when we
setup the data for transmission between the processes. However, I
don’t see that translation being done before we print the error
message. Hence, I think the error message is printing out the device
ID incorrectly - and the problem truly is just that the queues are
different.
Does the code convert the device id/part number into HBO before
looking it up in the .ini file?
All I could see was that it is converted to NBO for transmission, and
to HBO at the remote end for use. So both sides should have accurate
IDs. I don’t know what happens beyond that, I’m afraid - this isn’t my
particular code area.
That makes 2 of us :)
Where is this code located in the ompi tree?
Are there any verbose parameters that will help show more detail on how
it is searching the .ini file?
Assuming atlas3 is just displaying the vendor and part numbers w/o
converting to HBO, they do look correct. part ID 21505 is 0x5401,
and part ID 22282240 is 0x5401 swapped:
[root@atlas3 openmpi]# echo $((0x5401))
21505
[root@atlas3 openmpi]# echo $((0x01540000))
22282240
Looking at the .ini on both nodes however, I see valid and identical
entries for device 0x1425/0x5401:
[root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
[Chelsio T5]
vendor_id = 0x1425
vendor_part_id =
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
[root@atlas3 openmpi]# grep -3 0x5401 *ini
[Chelsio T5]
vendor_id = 0x1425
vendor_part_id =
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
So I still think somehow the one node is looking up the wrong entry
in the .ini file.
Also: Attached are the ompi-info outputs and a diff of the two.
Steve.
On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
This is not a heterogeneous run-time issue -- it's the issue that
Nathan cited: that OMPI detected different receive queue setups on
different machines.
As the error message states; the openib BTL simply cannot handle
when different MPI processes specific different receive queue
specifications.
You mentioned that the device ID is being incorrectly identified:
is that OMPI's fault, or something wrong with the device itself?
On Jun 1, 2015, at 6:06 PM, Steve Wise
<sw...@opengridcomputing.com <mailto:sw...@opengridcomputing.com>>
wrote:
On 6/1/2015 9:53 AM, Ralph Castain wrote:
Well, I checked and it looks to me like —hetero-apps is a stale
option in the master at least - I don’t see where it gets used.
Looking at the code, I would suspect that something didn’t get
configured correctly - either the —enable-heterogeneous flag
didn’t get set on one side, or we incorrectly failed to identify
the BE machine, or both. You might run ompi_info on the two sides
and verify they both were built correctly
We'll check ompi_info...
Thanks!
Steve.
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27025.php
--
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com>
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27026.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/06/27027.php
<atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/06/27030.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27031.php