Steve,
MCA_BTL_OPENIB_MODEX_MSG_{HTON,NTOH} do not convert all the fields of the mca_btl_openib_modex_message_t struct. I would start here ... Cheers, Gilles On Wednesday, June 3, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Steve -- > > I think that this falls directly in your prevue since you volunteered to > maintain the openib BTL (this HCA ID thing is part of the openib BTL > bootstrapping). :-) > > > > On Jun 2, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org > <javascript:;>> wrote: > > > > > >> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com > <javascript:;>> wrote: > >> > >> On 6/1/2015 9:51 PM, Ralph Castain wrote: > >>> I’m wondering if it is also possible that the error message is simply > printing that ID incorrectly. Looking at the code, it appears that we do > perform the network byte translation correctly when we setup the data for > transmission between the processes. However, I don’t see that translation > being done before we print the error message. Hence, I think the error > message is printing out the device ID incorrectly - and the problem truly > is just that the queues are different. > >>> > >> > >> Does the code convert the device id/part number into HBO before looking > it up in the .ini file? > > > > All I could see was that it is converted to NBO for transmission, and to > HBO at the remote end for use. So both sides should have accurate IDs. I > don’t know what happens beyond that, I’m afraid - this isn’t my particular > code area. > > > >> > >> Assuming atlas3 is just displaying the vendor and part numbers w/o > converting to HBO, they do look correct. part ID 21505 is 0x5401, and part > ID 22282240 is 0x5401 swapped: > >> > >> [root@atlas3 openmpi]# echo $((0x5401)) > >> 21505 > >> [root@atlas3 openmpi]# echo $((0x01540000)) > >> 22282240 > >> > >> Looking at the .ini on both nodes however, I see valid and identical > entries for device 0x1425/0x5401: > >> > >> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini > >> > >> [Chelsio T5] > >> vendor_id = 0x1425 > >> vendor_part_id = > 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 > >> use_eager_rdma = 1 > >> mtu = 2048 > >> receive_queues = P,65536,64 > >> > >> [root@atlas3 openmpi]# grep -3 0x5401 *ini > >> > >> [Chelsio T5] > >> vendor_id = 0x1425 > >> vendor_part_id = > 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413 > >> use_eager_rdma = 1 > >> mtu = 2048 > >> receive_queues = P,65536,64 > >> > >> So I still think somehow the one node is looking up the wrong entry in > the .ini file. > >> > >> Also: Attached are the ompi-info outputs and a diff of the two. > >> > >> Steve. > >> > >> > >> > >>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com <javascript:;>> wrote: > >>>> > >>>> This is not a heterogeneous run-time issue -- it's the issue that > Nathan cited: that OMPI detected different receive queue setups on > different machines. > >>>> > >>>> As the error message states; the openib BTL simply cannot handle when > different MPI processes specific different receive queue specifications. > >>>> > >>>> You mentioned that the device ID is being incorrectly identified: is > that OMPI's fault, or something wrong with the device itself? > >>>> > >>>> > >>>> > >>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com > <javascript:;>> wrote: > >>>>> > >>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote: > >>>>>> Well, I checked and it looks to me like —hetero-apps is a stale > option in the master at least - I don’t see where it gets used. > >>>>>> > >>>>>> Looking at the code, I would suspect that something didn’t get > configured correctly - either the —enable-heterogeneous flag didn’t get set > on one side, or we incorrectly failed to identify the BE machine, or both. > You might run ompi_info on the two sides and verify they both were built > correctly > >>>>> We'll check ompi_info... > >>>>> > >>>>> Thanks! > >>>>> > >>>>> Steve. > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org <javascript:;> > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27025.php > >>>> > >>>> -- > >>>> Jeff Squyres > >>>> jsquy...@cisco.com <javascript:;> > >>>> For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org <javascript:;> > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27026.php > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org <javascript:;> > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27027.php > >> > >> > <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________ > >> users mailing list > >> us...@open-mpi.org <javascript:;> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27030.php > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org <javascript:;> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27031.php > > > -- > Jeff Squyres > jsquy...@cisco.com <javascript:;> > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:;> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27033.php >