Prentice Bisbal wrote:
Eugene Loh wrote:
  
Prentice Bisbal wrote:
    
Is there a limit on how many MPI processes can run on a single host?
      
Depending on which OMPI release you're using, I think you need something like 4*np up to 7*np (plus a few) descriptors.  So, with 256, you need 1000+ descriptors.  You're quite possibly up against your limit, though I don't know for sure that that's the problem here.

You say you're running 1.2.8.  That's "a while ago", so would you consider updating as a first step?  Among other things, newer OMPIs will generate a much clearer error message if the descriptor limit is the problem.
I have a user trying to test his code on the command-line on a single
host before running it on our cluster like so:

mpirun -np X foo

When he tries to run it on large number of process (X = 256, 512), the
program fails, and I can reproduce this with a simple "Hello, World"
program:

$ mpirun -np 256 mpihello
mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
exited on signal 15 (Terminated).
252 additional processes aborted (not shown)

I've done some testing and found that X <155 for this program to work.
Is this a bug, part of the standard, or design/implementation decision?


      
One possible issue is the limit on the number of descriptors.  The error
message should be pretty helpful and descriptive, but perhaps you're
using an older version of OMPI.  If this is your problem, one workaround
is something like this:

unlimit descriptors
mpirun -np 256 mpihello
    
Looks like I'm not allowed to set that as a regular user:

$ ulimit -n 2048
-bash: ulimit: open files: cannot modify limit: Operation not permitted

Since I am the admin, I could change that elsewhere, but I'd rather not
do that system-wide unless absolutely necessary.
  
though I guess the syntax depends on what shell you're running.  Another
is to set the MCA parameter opal_set_max_sys_limits to 1.
    
That didn't work either:

$ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello
mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
exited on signal 15 (Terminated).
252 additional processes aborted (not shown)

  

Reply via email to