On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
On 6/1/2015 9:51 PM, Ralph Castain wrote:
I’m wondering if it is also possible that the error message is simply printing
that ID incorrectly. Looking at the code, it appears that we do perform the
network byte translation correctly when we setup the data for transmission
between the processes. However, I don’t see that translation being done before
we print the error message. Hence, I think the error message is printing out
the device ID incorrectly - and the problem truly is just that the queues are
different.
Does the code convert the device id/part number into HBO before looking it up
in the .ini file?
Assuming atlas3 is just displaying the vendor and part numbers w/o converting
to HBO, they do look correct. part ID 21505 is 0x5401, and part ID 22282240 is
0x5401 swapped:
[root@atlas3 openmpi]# echo $((0x5401))
21505
[root@atlas3 openmpi]# echo $((0x01540000))
22282240
Looking at the .ini on both nodes however, I see valid and identical entries
for device 0x1425/0x5401:
[root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
[Chelsio T5]
vendor_id = 0x1425
vendor_part_id =
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
[root@atlas3 openmpi]# grep -3 0x5401 *ini
[Chelsio T5]
vendor_id = 0x1425
vendor_part_id =
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
use_eager_rdma = 1
mtu = 2048
receive_queues = P,65536,64
So I still think somehow the one node is looking up the wrong entry in the .ini
file.
Also: Attached are the ompi-info outputs and a diff of the two.
Steve.
On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
This is not a heterogeneous run-time issue -- it's the issue that Nathan cited:
that OMPI detected different receive queue setups on different machines.
As the error message states; the openib BTL simply cannot handle when different
MPI processes specific different receive queue specifications.
You mentioned that the device ID is being incorrectly identified: is that
OMPI's fault, or something wrong with the device itself?
On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote:
On 6/1/2015 9:53 AM, Ralph Castain wrote:
Well, I checked and it looks to me like —hetero-apps is a stale option in the
master at least - I don’t see where it gets used.
Looking at the code, I would suspect that something didn’t get configured
correctly - either the —enable-heterogeneous flag didn’t get set on one side,
or we incorrectly failed to identify the BE machine, or both. You might run
ompi_info on the two sides and verify they both were built correctly
We'll check ompi_info...
Thanks!
Steve.
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27025.php
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27026.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27027.php
<atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27030.php