On Nov 9, 2017, at 6:51 PM, Forai,Petar <petar.fo...@imp.ac.at> wrote:
> 
> We’re observing output such as the following when running non-trivial MPI 
> software through  SLURM’s srun
> 
> [cn-11:52778] unrecognized payload type 255
> [cn-11:52778] base = 0x9ce2c0, proto = 0x9ce2c0, hdr = 0x9ce300
> [cn-11:52778]    0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [cn-11:52778]   10: 00 00 00 00 00 00 06 02 ff 0c 1f c2 06 02 ff 0c
> [cn-11:52778]   20: b9 8f 08 00 45 00 00 3c 00 00 40 00 08 11 5d 5d
> [cn-11:52778]   30: 0a 95 00 16 0a 95 00 15 e5 05 e8 d9 00 28 7c 8c
> [cn-11:52778]   40: 01 00 00 00 00 00 31 b6 00 00 8f e3 00 00 00 00
> [cn-11:52778]   50: 00 00 00 00 00 00 06 02 ff 0c d3 25 06 02 ff 0c
> [cn-11:52778]   60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [cn-11:52778]   70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> It is independent of the software BUT is NOT observable when running with 
> mpiexec/mpirun.

That is extremely odd.  I cannot think of how the choice of launcher would 
affect the usNIC BTL.

>  When switching to the TCP or vader BTL we have clean output and the message 
> is not observed. It is output by different ranks on various nodes, so not 
> reproducibly the same nodes.
> 
> The location of the message seems to be from here[1] 

Let me take a step back and explain the usNIC BTL: it uses OS-bypass UDP for 
communication.  This means that it is connectionless, and will accept datagrams 
from anywhere.  When the usNIC BTL receives a message, it does a few things to 
verify that it is both an Open MPI frame and from a peer that it recognizes.  
If the message fails any of the verifications, the usNIC BTL simply drops it.

There are two usual reasons that the usNIC BTL ends up dropping a message:

1. It was a valid message from a peer, but it got corrupted in transit.

PSA: corrupted packets happen.  Usually the network layer filters them out and 
user-level processes don't see them -- but rarely they can eek through and 
still be received in userspace [with very low frequency].

If a valid message gets dropped, it will simply be re-transmitted by the sender 
a short time later.

2. It was a message from something else (i.e., a non-Open MPI sender).

In my internal Cisco testing, for example, I periodically get frames from Cisco 
IT malware scanners (i.e., they find my open usNIC UDP ports and try to send 
traffic to them).  In these cases, the usNIC BTL dropping the frame is the 
Right Thing To Do.

> Any idea how to get rid of this or what might be the root cause? Hints what 
> to check for would be greatly appreciated! 

The messages are actually harmless -- they're just the usNIC BTL indicating 
that it is dropping a message.

But I can see how that would be annoying -- I'll switch the default to turn 
them off by default for future versions (and only turn them on if the user 
specifically requests them).

For an immediate fix, you can basically #if 0 out the block in btl_usnic_recv.c 
that prints out those messages.  The attached patch does that and is against 
v2.0.2, but note that we literally just released v2.0.4 today (just additional 
bug fixes against the v2.0.x series).  Finally, the latest released version of 
Open MPI is v3.0.0, if you feel like upgrading.

-- 
Jeff Squyres
jsquy...@cisco.com

Attachment: ompi-v2.0.2-usnic-disable-unknown-post-messages.diff
Description: ompi-v2.0.2-usnic-disable-unknown-post-messages.diff

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to