On 6/2/2015 10:46 AM, Gilles Gouaillardet wrote:
Steve,


MCA_BTL_OPENIB_MODEX_MSG_{HTON,NTOH} do not convert all the fields of the mca_btl_openib_modex_message_t struct.

I would start here ...


Thanks.

Cheers,

Gilles

On Wednesday, June 3, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:

    Steve --

    I think that this falls directly in your prevue since you
    volunteered to maintain the openib BTL (this HCA ID thing is part
    of the openib BTL bootstrapping).  :-)


    > On Jun 2, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org
    <javascript:;>> wrote:
    >
    >
    >> On Jun 2, 2015, at 7:10 AM, Steve Wise
    <sw...@opengridcomputing.com <javascript:;>> wrote:
    >>
    >> On 6/1/2015 9:51 PM, Ralph Castain wrote:
    >>> I’m wondering if it is also possible that the error message is
    simply printing that ID incorrectly. Looking at the code, it
    appears that we do perform the network byte translation correctly
    when we setup the data for transmission between the processes.
    However, I don’t see that translation being done before we print
    the error message. Hence, I think the error message is printing
    out the device ID incorrectly - and the problem truly is just that
    the queues are different.
    >>>
    >>
    >> Does the code convert the device id/part number into HBO before
    looking it up in the .ini file?
    >
    > All I could see was that it is converted to NBO for
    transmission, and to HBO at the remote end for use.  So both sides
    should have accurate IDs. I don’t know what happens beyond that,
    I’m afraid - this isn’t my particular code area.
    >
    >>
    >> Assuming atlas3 is just displaying the vendor and part numbers
    w/o converting to HBO, they do look correct. part ID 21505 is
    0x5401, and part ID 22282240 is 0x5401 swapped:
    >>
    >> [root@atlas3 openmpi]# echo $((0x5401))
    >> 21505
    >> [root@atlas3 openmpi]# echo $((0x01540000))
    >> 22282240
    >>
    >> Looking at the .ini on both nodes however, I see valid and
    identical entries for device 0x1425/0x5401:
    >>
    >> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
    >>
    >> [Chelsio T5]
    >> vendor_id = 0x1425
    >> vendor_part_id =
    
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
    >> use_eager_rdma = 1
    >> mtu = 2048
    >> receive_queues = P,65536,64
    >>
    >> [root@atlas3 openmpi]# grep -3 0x5401 *ini
    >>
    >> [Chelsio T5]
    >> vendor_id = 0x1425
    >> vendor_part_id =
    
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
    >> use_eager_rdma = 1
    >> mtu = 2048
    >> receive_queues = P,65536,64
    >>
    >> So I still think somehow the one node is looking up the wrong
    entry in the .ini file.
    >>
    >> Also:  Attached are the ompi-info outputs and a diff of the two.
    >>
    >> Steve.
    >>
    >>
    >>
    >>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres)
    <jsquy...@cisco.com <javascript:;>> wrote:
    >>>>
    >>>> This is not a heterogeneous run-time issue -- it's the issue
    that Nathan cited: that OMPI detected different receive queue
    setups on different machines.
    >>>>
    >>>> As the error message states; the openib BTL simply cannot
    handle when different MPI processes specific different receive
    queue specifications.
    >>>>
    >>>> You mentioned that the device ID is being incorrectly
    identified: is that OMPI's fault, or something wrong with the
    device itself?
    >>>>
    >>>>
    >>>>
    >>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise
    <sw...@opengridcomputing.com <javascript:;>> wrote:
    >>>>>
    >>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote:
    >>>>>> Well, I checked and it looks to me like —hetero-apps is a
    stale option in the master at least - I don’t see where it gets used.
    >>>>>>
    >>>>>> Looking at the code, I would suspect that something didn’t
    get configured correctly - either the —enable-heterogeneous flag
    didn’t get set on one side, or we incorrectly failed to identify
    the BE machine, or both. You might run ompi_info on the two sides
    and verify they both were built correctly
    >>>>> We'll check ompi_info...
    >>>>>
    >>>>> Thanks!
    >>>>>
    >>>>> Steve.
    >>>>> _______________________________________________
    >>>>> users mailing list
    >>>>> us...@open-mpi.org <javascript:;>
    >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>>> Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/06/27025.php
    >>>>
    >>>> --
    >>>> Jeff Squyres
    >>>> jsquy...@cisco.com <javascript:;>
    >>>> For corporate legal information go to:
    http://www.cisco.com/web/about/doing_business/legal/cri/
    >>>>
    >>>> _______________________________________________
    >>>> users mailing list
    >>>> us...@open-mpi.org <javascript:;>
    >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>> Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/06/27026.php
    >>> _______________________________________________
    >>> users mailing list
    >>> us...@open-mpi.org <javascript:;>
    >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>> Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/06/27027.php
    >>
    >>
    
<atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
    >> users mailing list
    >> us...@open-mpi.org <javascript:;>
    >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    >> Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/06/27030.php
    >
    > _______________________________________________
    > users mailing list
    > us...@open-mpi.org <javascript:;>
    > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    > Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/06/27031.php


    --
    Jeff Squyres
    jsquy...@cisco.com <javascript:;>
    For corporate legal information go to:
    http://www.cisco.com/web/about/doing_business/legal/cri/

    _______________________________________________
    users mailing list
    us...@open-mpi.org <javascript:;>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/06/27033.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27034.php

Reply via email to