On Nov 9, 2017, at 6:51 PM, Forai,Petar <petar.fo...@imp.ac.at> wrote: > > We’re observing output such as the following when running non-trivial MPI > software through SLURM’s srun > > [cn-11:52778] unrecognized payload type 255 > [cn-11:52778] base = 0x9ce2c0, proto = 0x9ce2c0, hdr = 0x9ce300 > [cn-11:52778] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [cn-11:52778] 10: 00 00 00 00 00 00 06 02 ff 0c 1f c2 06 02 ff 0c > [cn-11:52778] 20: b9 8f 08 00 45 00 00 3c 00 00 40 00 08 11 5d 5d > [cn-11:52778] 30: 0a 95 00 16 0a 95 00 15 e5 05 e8 d9 00 28 7c 8c > [cn-11:52778] 40: 01 00 00 00 00 00 31 b6 00 00 8f e3 00 00 00 00 > [cn-11:52778] 50: 00 00 00 00 00 00 06 02 ff 0c d3 25 06 02 ff 0c > [cn-11:52778] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [cn-11:52778] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > It is independent of the software BUT is NOT observable when running with > mpiexec/mpirun.
That is extremely odd. I cannot think of how the choice of launcher would affect the usNIC BTL. > When switching to the TCP or vader BTL we have clean output and the message > is not observed. It is output by different ranks on various nodes, so not > reproducibly the same nodes. > > The location of the message seems to be from here[1] Let me take a step back and explain the usNIC BTL: it uses OS-bypass UDP for communication. This means that it is connectionless, and will accept datagrams from anywhere. When the usNIC BTL receives a message, it does a few things to verify that it is both an Open MPI frame and from a peer that it recognizes. If the message fails any of the verifications, the usNIC BTL simply drops it. There are two usual reasons that the usNIC BTL ends up dropping a message: 1. It was a valid message from a peer, but it got corrupted in transit. PSA: corrupted packets happen. Usually the network layer filters them out and user-level processes don't see them -- but rarely they can eek through and still be received in userspace [with very low frequency]. If a valid message gets dropped, it will simply be re-transmitted by the sender a short time later. 2. It was a message from something else (i.e., a non-Open MPI sender). In my internal Cisco testing, for example, I periodically get frames from Cisco IT malware scanners (i.e., they find my open usNIC UDP ports and try to send traffic to them). In these cases, the usNIC BTL dropping the frame is the Right Thing To Do. > Any idea how to get rid of this or what might be the root cause? Hints what > to check for would be greatly appreciated! The messages are actually harmless -- they're just the usNIC BTL indicating that it is dropping a message. But I can see how that would be annoying -- I'll switch the default to turn them off by default for future versions (and only turn them on if the user specifically requests them). For an immediate fix, you can basically #if 0 out the block in btl_usnic_recv.c that prints out those messages. The attached patch does that and is against v2.0.2, but note that we literally just released v2.0.4 today (just additional bug fixes against the v2.0.x series). Finally, the latest released version of Open MPI is v3.0.0, if you feel like upgrading. -- Jeff Squyres jsquy...@cisco.com
ompi-v2.0.2-usnic-disable-unknown-post-messages.diff
Description: ompi-v2.0.2-usnic-disable-unknown-post-messages.diff
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users