Re: [OMPI users] silent failure for large allgather

Jeff Squyres (jsquyres) via users Fri, 13 Sep 2019 10:21:10 -0700

Emmanuel --

Looks like the right people missed this when you posted; sorry about that!


We're tracking it now: https://github.com/open-mpi/ompi/issues/6976



On Sep 13, 2019, at 3:04 AM, Emmanuel Thomé via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Hi,

Thanks Jeff for your reply, and sorry for this late follow-up...

On Sun, Aug 11, 2019 at 02:27:53PM -0700, Jeff Hammond wrote:
openmpi-4.0.1 gives essentially the same results (similar files
attached), but with various doubts on my part as to whether I've run this
check correctly. Here are my doubts:
   - whether I should or not have an ucx build for an omnipath cluster
     (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),


UCX is not optimized for Omni Path.  Don't use it.

good.

Does that mean that the information conveyed by this message is
incomplete ? It's easy to misconstrue it as an invitation to enable ucx.

   --------------------------------------------------------------------------
   By default, for Open MPI 4.0 and later, infiniband ports on a device
   are not used by default.  The intent is to use UCX for these devices.
   You can override this policy by setting the btl_openib_allow_ib MCA parameter
   to true.

     Local host:              node0
     Local adapter:           hfi1_0
     Local port:              1

   --------------------------------------------------------------------------
   --------------------------------------------------------------------------
   WARNING: There was an error initializing an OpenFabrics device.

     Local host:   node0
     Local device: hfi1_0
   --------------------------------------------------------------------------

   - which btl I should use (I understand that openib goes to
     deprecation and it complains unless I do --mca btl openib --mca
     btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
     btl should I use instead ?)


OFI->PS2 and PSM2 are the right conduits for Omni Path.

I assume you meant ofi->psm2 and psm2. I understand that --mca mtl ofi
should be Right in that case, and that --mca mtl psm2 should be as well.
Which unfortunately doesn't tell me much about pml and btl selection, if
these happen to matter (pml certainly, based on my initial report).

It sounds like Open-MPI doesn't properly support the maximum transfer size
of PSM2.  One way to work around this is to wrap your MPI collective calls
and do <4G chunking yourself.

I'm afraid that it's not a very satisfactory answer. Once I've spent some
time diagnosing the issue, sure I could do that sort of kludge.

But the path to discovering the issue is long-winded. I'd have been
*MUCH* better off if openmpi spat at me a big loud error message (like it
does for psm2). The fact that it silently omits copying some of my data
with the mtl ofi is extremely annoying.

Best,

E.
_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] silent failure for large allgather

Reply via email to