On 3/6/19 11:29 AM, Amir Shehata wrote:
The reason for the load being split across the tcp and o2ib0 for the 2.12 client, is because the MR code sees both interfaces and realizes it can use both of them and so it does. To disable this behavior you can disable discovery on the 2.12 client. I think that should just get the client to only use the single interface it's told to.
thank you very much, this worked out well.
We're currently working on a feature (UDSP) which will allow the specification of a "preferred" network. In your case you can set the o2ib to be the preferred network. It'll always be used unless it becomes unavailable. You get two benefits this way: 1) your preference is adhered to. 2) reliability, since the tcp network will be used if the o2ib network becomes unavailable.this feature
this feature (UDSP) would e really great.

Let me know if disabling discovery on your 2.12 clients work.

yes after disabling discovery on the client side, the situation is much better


thank you very much



thanks
amir

On Tue, 5 Mar 2019 at 18:49, Riccardo Veraldi <[email protected] <mailto:[email protected]>> wrote:

    Hello Amir i answer in-line

    On 3/5/19 3:42 PM, Amir Shehata wrote:
    It looks like the ping is passing. Did you try it several times
    to make sure it always pings successfully?

    The way it works is the MDS (2.12) discovers all the interfaces
    on the peer. There is a concept of the primary NID for the peer.
    That's the first interface configured on the peer. In your case
    it's the o2ib NID. So when you do lnetctl net show you'll see
    Primary NID: <nid>@o2ib.

        - primary nid: 172.21.52.88@o2ib
           Multi-Rail: True
           peer ni:
             - nid: 172.21.48.250@tcp
               state: NA
             - nid: 172.21.52.88@o2ib
               state: NA
             - nid: 172.21.48.250@tcp1
               state: NA
             - nid: 172.21.48.250@tcp2
               state: NA

    On the MDS it uses the primary_nid to identify the peer. So you
    can ping using the Primary NID. LNet will resolve the Primary NID
    to the tcp NID. As you can see in the logs, it never actually
    talks over o2ib. It ends up talking to the peer on its TCP NID,
    which is what you want to do.

    I think the problem you're seeing is caused by the combination of
    2.12 and 2.10.x.
    From what I understand your servers are 2.12 and your clients are
    2.10.x.
    my clients are 2.10.5 but this problem arise also with one client
    2.12.0, anyway the combination of 2.10.0 clients and 2.12.0 is not
    working right

    Can you try disabling dynamic discovery on your servers:
    lnetctl set discovery 0

    I did this on the MDS and OSS. I did not disable discovery on the
    client side.

    now on the MDS side lnetctl peer show looks right.

    Anyway on the client side where I have both IB and tcp if I write
    on the lustre filesystem (OSS) what hapens is that the write
    operation is splitte/load balanced between IB and tcp (Ethernet)
    and I do not want this. I would like that only IB would be used
    when the client writes data to the OSS. but both peer ni
    (o2ib,tcp) are seen from the 2.12.0 client and traffic goes to
    both of them thus reducing performances because IB is not fully
    used. This does not happen with 2.10.5 client writing on the same
    2.12.0 OSS


_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to