Steve -- I think that this falls directly in your prevue since you volunteered to maintain the openib BTL (this HCA ID thing is part of the openib BTL bootstrapping). :-)
> On Jun 2, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote: > > >> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com> wrote: >> >> On 6/1/2015 9:51 PM, Ralph Castain wrote: >>> I’m wondering if it is also possible that the error message is simply >>> printing that ID incorrectly. Looking at the code, it appears that we do >>> perform the network byte translation correctly when we setup the data for >>> transmission between the processes. However, I don’t see that translation >>> being done before we print the error message. Hence, I think the error >>> message is printing out the device ID incorrectly - and the problem truly >>> is just that the queues are different. >>> >> >> Does the code convert the device id/part number into HBO before looking it >> up in the .ini file? > > All I could see was that it is converted to NBO for transmission, and to HBO > at the remote end for use. So both sides should have accurate IDs. I don’t > know what happens beyond that, I’m afraid - this isn’t my particular code > area. > >> >> Assuming atlas3 is just displaying the vendor and part numbers w/o >> converting to HBO, they do look correct. part ID 21505 is 0x5401, and part >> ID 22282240 is 0x5401 swapped: >> >> [root@atlas3 openmpi]# echo $((0x5401)) >> 21505 >> [root@atlas3 openmpi]# echo $((0x01540000)) >> 22282240 >> >> Looking at the .ini on both nodes however, I see valid and identical entries >> for device 0x1425/0x5401: >> >> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini >> >> [Chelsio T5] >> vendor_id = 0x1425 >> vendor_part_id = >> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 >> use_eager_rdma = 1 >> mtu = 2048 >> receive_queues = P,65536,64 >> >> [root@atlas3 openmpi]# grep -3 0x5401 *ini >> >> [Chelsio T5] >> vendor_id = 0x1425 >> vendor_part_id = >> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 >> use_eager_rdma = 1 >> mtu = 2048 >> receive_queues = P,65536,64 >> >> So I still think somehow the one node is looking up the wrong entry in the >> .ini file. >> >> Also: Attached are the ompi-info outputs and a diff of the two. >> >> Steve. >> >> >> >>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>> wrote: >>>> >>>> This is not a heterogeneous run-time issue -- it's the issue that Nathan >>>> cited: that OMPI detected different receive queue setups on different >>>> machines. >>>> >>>> As the error message states; the openib BTL simply cannot handle when >>>> different MPI processes specific different receive queue specifications. >>>> >>>> You mentioned that the device ID is being incorrectly identified: is that >>>> OMPI's fault, or something wrong with the device itself? >>>> >>>> >>>> >>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com> >>>>> wrote: >>>>> >>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote: >>>>>> Well, I checked and it looks to me like —hetero-apps is a stale option >>>>>> in the master at least - I don’t see where it gets used. >>>>>> >>>>>> Looking at the code, I would suspect that something didn’t get >>>>>> configured correctly - either the —enable-heterogeneous flag didn’t get >>>>>> set on one side, or we incorrectly failed to identify the BE machine, or >>>>>> both. You might run ompi_info on the two sides and verify they both were >>>>>> built correctly >>>>> We'll check ompi_info... >>>>> >>>>> Thanks! >>>>> >>>>> Steve. >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/06/27025.php >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/06/27026.php >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/06/27027.php >> >> <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27030.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27031.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/