In the master, the code is in opal/mca/btl/openib/btl_openib_component.c In the 1.8/1.10 series, the code is in the same file, but located under the ompi/mca/btl/openib directory
> On Jun 2, 2015, at 8:14 AM, Steve Wise <sw...@opengridcomputing.com> wrote: > > On 6/2/2015 10:04 AM, Ralph Castain wrote: >> >>> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com >>> <mailto:sw...@opengridcomputing.com>> wrote: >>> >>> On 6/1/2015 9:51 PM, Ralph Castain wrote: >>>> I’m wondering if it is also possible that the error message is simply >>>> printing that ID incorrectly. Looking at the code, it appears that we do >>>> perform the network byte translation correctly when we setup the data for >>>> transmission between the processes. However, I don’t see that translation >>>> being done before we print the error message. Hence, I think the error >>>> message is printing out the device ID incorrectly - and the problem truly >>>> is just that the queues are different. >>>> >>> >>> Does the code convert the device id/part number into HBO before looking it >>> up in the .ini file? >> >> All I could see was that it is converted to NBO for transmission, and to HBO >> at the remote end for use. So both sides should have accurate IDs. I don’t >> know what happens beyond that, I’m afraid - this isn’t my particular code >> area. >> > That makes 2 of us :) > > Where is this code located in the ompi tree? > > Are there any verbose parameters that will help show more detail on how it is > searching the .ini file? > > >>> >>> Assuming atlas3 is just displaying the vendor and part numbers w/o >>> converting to HBO, they do look correct. part ID 21505 is 0x5401, and part >>> ID 22282240 is 0x5401 swapped: >>> >>> [root@atlas3 openmpi]# echo $((0x5401)) >>> 21505 >>> [root@atlas3 openmpi]# echo $((0x01540000)) >>> 22282240 >>> >>> Looking at the .ini on both nodes however, I see valid and identical >>> entries for device 0x1425/0x5401: >>> >>> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini >>> >>> [Chelsio T5] >>> vendor_id = 0x1425 >>> vendor_part_id = >>> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 >>> use_eager_rdma = 1 >>> mtu = 2048 >>> receive_queues = P,65536,64 >>> >>> [root@atlas3 openmpi]# grep -3 0x5401 *ini >>> >>> [Chelsio T5] >>> vendor_id = 0x1425 >>> vendor_part_id = >>> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 >>> use_eager_rdma = 1 >>> mtu = 2048 >>> receive_queues = P,65536,64 >>> >>> So I still think somehow the one node is looking up the wrong entry in the >>> .ini file. >>> >>> Also: Attached are the ompi-info outputs and a diff of the two. >>> >>> Steve. >>> >>> >>> >>>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>>>> <mailto:jsquy...@cisco.com>> wrote: >>>>> >>>>> This is not a heterogeneous run-time issue -- it's the issue that Nathan >>>>> cited: that OMPI detected different receive queue setups on different >>>>> machines. >>>>> >>>>> As the error message states; the openib BTL simply cannot handle when >>>>> different MPI processes specific different receive queue specifications. >>>>> >>>>> You mentioned that the device ID is being incorrectly identified: is that >>>>> OMPI's fault, or something wrong with the device itself? >>>>> >>>>> >>>>> >>>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com >>>>>> <mailto:sw...@opengridcomputing.com>> wrote: >>>>>> >>>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote: >>>>>>> Well, I checked and it looks to me like —hetero-apps is a stale option >>>>>>> in the master at least - I don’t see where it gets used. >>>>>>> >>>>>>> Looking at the code, I would suspect that something didn’t get >>>>>>> configured correctly - either the —enable-heterogeneous flag didn’t get >>>>>>> set on one side, or we incorrectly failed to identify the BE machine, >>>>>>> or both. You might run ompi_info on the two sides and verify they both >>>>>>> were built correctly >>>>>> We'll check ompi_info... >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Steve. >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27025.php >>>>>> <http://www.open-mpi.org/community/lists/users/2015/06/27025.php> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/06/27026.php >>>>> <http://www.open-mpi.org/community/lists/users/2015/06/27026.php> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/06/27027.php >>>> <http://www.open-mpi.org/community/lists/users/2015/06/27027.php> >>> >>> <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/06/27030.php >>> <http://www.open-mpi.org/community/lists/users/2015/06/27030.php> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27031.php >> <http://www.open-mpi.org/community/lists/users/2015/06/27031.php> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27032.php