Can you provide a reproducer for the hang? What kernel version are you using?
Is xpmem installed?
-Nathan
On Jun 05, 2017, at 10:53 AM, Matt Thompson <fort...@gmail.com> wrote:
OMPI Users,
I was wondering if there is a best way to "tune" vader to get around an
intermittent MPI_Wait halt?
I ask because I recently found that if I use Open MPI 2.1.x on either my desktop or on
the supercomputer I have access to, if vader is enabled, the model seems to
"deadlock" at an MPI_Wait call. If I run as:
mpirun --mca btl self,sm,tcp
on my desktop it works. When I moved to my cluster, I tried the more generic:
mpirun --mca btl ^vader
since it uses openib, and with it things work. Well, I hope that's how one
would turn off vader in MCA speak. (Note: this deadlock seems a bit sporadic,
but I do now have a case which seems to cause it reproducibly).
Now, I know vader is supposed to be the "better" sm communication tech, so I'd
rather use it and thought maybe I could twiddle some tuning knobs. So I looked at:
https://www.open-mpi.org/faq/?category=sm
and there I saw question 6 "How do I know what MCA parameters are available for
tuning MPI performance?". But when I try the commands listed (minus the HTML/CSS
tags):
(1081) $ ompi_info --param btl sm
MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
(1082) $ ompi_info --param mpool sm
(1083) $
Huh. I expected more, but searching around the Open MPI FAQs made me think I
should use:
ompi_info --param btl sm --level 9
which does spit out a lot, though the equivalent for mpool sm does not.
Any ideas on which of the many knobs is best to try and turn? Something that, by default,
perhaps is one thing for sm but different for vader? I tried to see if "ompi_info
--param btl vader --level 9" did something, but it doesn't put anything out.
I will note that this code runs just fine with Open MPI 2.0.2 as well as with Intel MPI
and SGI MPT, so I'm thinking the code itself is okay, but something from Open MPI 2.0.x
to Open MPI 2.1.x changed. I see two entries in the Open MPI 2.1.0 announcement about
vader, but nothing specific about how to "revert" if they are even causing the
problem:
- Fix regression that lowered the memory maximum message bandwidth for
large messages on some BTL network transports, such as openib, sm,
and vader.
- The vader BTL is now more efficient in terms of memory usage when
using XPMEM.
Thanks for any help,
Matt
--
Matt Thompson
Man Among Men
Fulcrum of History
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users