Can you provide a reproducer for the hang? What kernel version are you using? 
Is xpmem installed?

-Nathan

On Jun 05, 2017, at 10:53 AM, Matt Thompson <fort...@gmail.com> wrote:

OMPI Users,

I was wondering if there is a best way to "tune" vader to get around an 
intermittent MPI_Wait halt? 

I ask because I recently found that if I use Open MPI 2.1.x on either my desktop or on 
the supercomputer I have access to, if vader is enabled, the model seems to 
"deadlock" at an MPI_Wait call. If I run as:

  mpirun --mca btl self,sm,tcp 

on my desktop it works. When I moved to my cluster, I tried the more generic:

  mpirun --mca btl ^vader

since it uses openib, and with it things work. Well, I hope that's how one 
would turn off vader in MCA speak. (Note: this deadlock seems a bit sporadic, 
but I do now have a case which seems to cause it reproducibly).

Now, I know vader is supposed to be the "better" sm communication tech, so I'd 
rather use it and thought maybe I could twiddle some tuning knobs. So I looked at:

  https://www.open-mpi.org/faq/?category=sm

and there I saw question 6 "How do I know what MCA parameters are available for 
tuning MPI performance?". But when I try the commands listed (minus the HTML/CSS 
tags):

(1081) $ ompi_info --param btl sm
                 MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
(1082) $ ompi_info --param mpool sm
(1083) $

Huh. I expected more, but searching around the Open MPI FAQs made me think I 
should use:

  ompi_info --param btl sm --level 9

which does spit out a lot, though the equivalent for mpool sm does not. 

Any ideas on which of the many knobs is best to try and turn? Something that, by default, 
perhaps is one thing for sm but different for vader? I tried to see if "ompi_info 
--param btl vader --level 9" did something, but it doesn't put anything out.

I will note that this code runs just fine with Open MPI 2.0.2 as well as with Intel MPI 
and SGI MPT, so I'm thinking the code itself is okay, but something from Open MPI 2.0.x 
to Open MPI 2.1.x changed. I see two entries in the Open MPI 2.1.0 announcement about 
vader, but nothing specific about how to "revert" if they are even causing the 
problem:

- Fix regression that lowered the memory maximum message bandwidth for
 large messages on some BTL network transports, such as openib, sm,
 and vader.

- The vader BTL is now more efficient in terms of memory usage when
 using XPMEM.

Thanks for any help,
Matt


-- 
Matt Thompson
Man Among Men
Fulcrum of History
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to