Eugene Loh wrote: > Prentice Bisbal wrote: >> Eugene Loh wrote: >> >>> Prentice Bisbal wrote: >>> >>>> Is there a limit on how many MPI processes can run on a single host? >>>> > Depending on which OMPI release you're using, I think you need something > like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need > 1000+ descriptors. You're quite possibly up against your limit, though > I don't know for sure that that's the problem here. > > You say you're running 1.2.8. That's "a while ago", so would you > consider updating as a first step? Among other things, newer OMPIs will > generate a much clearer error message if the descriptor limit is the > problem.
While 1.2.8 might be "a while ago", upgrading software just because it's "old" is not a valid argument. I can install the lastest version of OpenMPI, but it will take a little while. >>>> I have a user trying to test his code on the command-line on a single >>>> host before running it on our cluster like so: >>>> >>>> mpirun -np X foo >>>> >>>> When he tries to run it on large number of process (X = 256, 512), the >>>> program fails, and I can reproduce this with a simple "Hello, World" >>>> program: >>>> >>>> $ mpirun -np 256 mpihello >>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu >>>> exited on signal 15 (Terminated). >>>> 252 additional processes aborted (not shown) >>>> >>>> I've done some testing and found that X <155 for this program to work. >>>> Is this a bug, part of the standard, or design/implementation decision? >>>> >>>> >>>> >>> One possible issue is the limit on the number of descriptors. The error >>> message should be pretty helpful and descriptive, but perhaps you're >>> using an older version of OMPI. If this is your problem, one workaround >>> is something like this: >>> >>> unlimit descriptors >>> mpirun -np 256 mpihello >>> >> >> Looks like I'm not allowed to set that as a regular user: >> >> $ ulimit -n 2048 >> -bash: ulimit: open files: cannot modify limit: Operation not permitted >> >> Since I am the admin, I could change that elsewhere, but I'd rather not >> do that system-wide unless absolutely necessary. >> >>> though I guess the syntax depends on what shell you're running. Another >>> is to set the MCA parameter opal_set_max_sys_limits to 1. >>> >> That didn't work either: >> >> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello >> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu >> exited on signal 15 (Terminated). >> 252 additional processes aborted (not shown) >> >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ