Steve,

MCA_BTL_OPENIB_MODEX_MSG_{HTON,NTOH} do not convert all the fields of the
mca_btl_openib_modex_message_t struct.

I would start here ...

Cheers,

Gilles

On Wednesday, June 3, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> Steve --
>
> I think that this falls directly in your prevue since you volunteered to
> maintain the openib BTL (this HCA ID thing is part of the openib BTL
> bootstrapping).  :-)
>
>
> > On Jun 2, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org
> <javascript:;>> wrote:
> >
> >
> >> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com
> <javascript:;>> wrote:
> >>
> >> On 6/1/2015 9:51 PM, Ralph Castain wrote:
> >>> I’m wondering if it is also possible that the error message is simply
> printing that ID incorrectly. Looking at the code, it appears that we do
> perform the network byte translation correctly when we setup the data for
> transmission between the processes. However, I don’t see that translation
> being done before we print the error message. Hence, I think the error
> message is printing out the device ID incorrectly - and the problem truly
> is just that the queues are different.
> >>>
> >>
> >> Does the code convert the device id/part number into HBO before looking
> it up in the .ini file?
> >
> > All I could see was that it is converted to NBO for transmission, and to
> HBO at the remote end for use.  So both sides should have accurate IDs. I
> don’t know what happens beyond that, I’m afraid - this isn’t my particular
> code area.
> >
> >>
> >> Assuming atlas3 is just displaying the vendor and part numbers w/o
> converting to HBO, they do look correct.  part ID 21505 is 0x5401, and part
> ID 22282240 is 0x5401 swapped:
> >>
> >> [root@atlas3 openmpi]# echo $((0x5401))
> >> 21505
> >> [root@atlas3 openmpi]# echo $((0x01540000))
> >> 22282240
> >>
> >> Looking at the .ini on both nodes however, I see valid and identical
> entries for device 0x1425/0x5401:
> >>
> >> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
> >>
> >> [Chelsio T5]
> >> vendor_id = 0x1425
> >> vendor_part_id =
> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
> >> use_eager_rdma = 1
> >> mtu = 2048
> >> receive_queues = P,65536,64
> >>
> >> [root@atlas3 openmpi]# grep -3 0x5401 *ini
> >>
> >> [Chelsio T5]
> >> vendor_id = 0x1425
> >> vendor_part_id =
> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
> >> use_eager_rdma = 1
> >> mtu = 2048
> >> receive_queues = P,65536,64
> >>
> >> So I still think somehow the one node is looking up the wrong entry in
> the .ini file.
> >>
> >> Also:  Attached are the ompi-info outputs and a diff of the two.
> >>
> >> Steve.
> >>
> >>
> >>
> >>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com <javascript:;>> wrote:
> >>>>
> >>>> This is not a heterogeneous run-time issue -- it's the issue that
> Nathan cited: that OMPI detected different receive queue setups on
> different machines.
> >>>>
> >>>> As the error message states; the openib BTL simply cannot handle when
> different MPI processes specific different receive queue specifications.
> >>>>
> >>>> You mentioned that the device ID is being incorrectly identified: is
> that OMPI's fault, or something wrong with the device itself?
> >>>>
> >>>>
> >>>>
> >>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com
> <javascript:;>> wrote:
> >>>>>
> >>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote:
> >>>>>> Well, I checked and it looks to me like —hetero-apps is a stale
> option in the master at least - I don’t see where it gets used.
> >>>>>>
> >>>>>> Looking at the code, I would suspect that something didn’t get
> configured correctly - either the —enable-heterogeneous flag didn’t get set
> on one side, or we incorrectly failed to identify the BE machine, or both.
> You might run ompi_info on the two sides and verify they both were built
> correctly
> >>>>> We'll check ompi_info...
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Steve.
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org <javascript:;>
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27025.php
> >>>>
> >>>> --
> >>>> Jeff Squyres
> >>>> jsquy...@cisco.com <javascript:;>
> >>>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org <javascript:;>
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27026.php
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org <javascript:;>
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27027.php
> >>
> >>
> <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
> >> users mailing list
> >> us...@open-mpi.org <javascript:;>
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27030.php
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org <javascript:;>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27031.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com <javascript:;>
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:;>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27033.php
>

Reply via email to