> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com> wrote: > > On 6/1/2015 9:51 PM, Ralph Castain wrote: >> I’m wondering if it is also possible that the error message is simply >> printing that ID incorrectly. Looking at the code, it appears that we do >> perform the network byte translation correctly when we setup the data for >> transmission between the processes. However, I don’t see that translation >> being done before we print the error message. Hence, I think the error >> message is printing out the device ID incorrectly - and the problem truly is >> just that the queues are different. >> > > Does the code convert the device id/part number into HBO before looking it up > in the .ini file?
All I could see was that it is converted to NBO for transmission, and to HBO at the remote end for use. So both sides should have accurate IDs. I don’t know what happens beyond that, I’m afraid - this isn’t my particular code area. > > Assuming atlas3 is just displaying the vendor and part numbers w/o converting > to HBO, they do look correct. part ID 21505 is 0x5401, and part ID 22282240 > is 0x5401 swapped: > > [root@atlas3 openmpi]# echo $((0x5401)) > 21505 > [root@atlas3 openmpi]# echo $((0x01540000)) > 22282240 > > Looking at the .ini on both nodes however, I see valid and identical entries > for device 0x1425/0x5401: > > [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini > > [Chelsio T5] > vendor_id = 0x1425 > vendor_part_id = > 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 > use_eager_rdma = 1 > mtu = 2048 > receive_queues = P,65536,64 > > [root@atlas3 openmpi]# grep -3 0x5401 *ini > > [Chelsio T5] > vendor_id = 0x1425 > vendor_part_id = > 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 > use_eager_rdma = 1 > mtu = 2048 > receive_queues = P,65536,64 > > So I still think somehow the one node is looking up the wrong entry in the > .ini file. > > Also: Attached are the ompi-info outputs and a diff of the two. > > Steve. > > > >>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>> wrote: >>> >>> This is not a heterogeneous run-time issue -- it's the issue that Nathan >>> cited: that OMPI detected different receive queue setups on different >>> machines. >>> >>> As the error message states; the openib BTL simply cannot handle when >>> different MPI processes specific different receive queue specifications. >>> >>> You mentioned that the device ID is being incorrectly identified: is that >>> OMPI's fault, or something wrong with the device itself? >>> >>> >>> >>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote: >>>> >>>> On 6/1/2015 9:53 AM, Ralph Castain wrote: >>>>> Well, I checked and it looks to me like —hetero-apps is a stale option in >>>>> the master at least - I don’t see where it gets used. >>>>> >>>>> Looking at the code, I would suspect that something didn’t get configured >>>>> correctly - either the —enable-heterogeneous flag didn’t get set on one >>>>> side, or we incorrectly failed to identify the BE machine, or both. You >>>>> might run ompi_info on the two sides and verify they both were built >>>>> correctly >>>> We'll check ompi_info... >>>> >>>> Thanks! >>>> >>>> Steve. >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/06/27025.php >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/06/27026.php >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27027.php >> <http://www.open-mpi.org/community/lists/users/2015/06/27027.php> > > <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27030.php > <http://www.open-mpi.org/community/lists/users/2015/06/27030.php>