On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote: > Eugene Loh wrote: >> Prentice Bisbal wrote: >>> Eugene Loh wrote: >>> >>>> Prentice Bisbal wrote: >>>> >>>>> Is there a limit on how many MPI processes can run on a single host? >>>>> >> Depending on which OMPI release you're using, I think you need something >> like 4*np up to 7*np (plus a few) descriptors. So, with 256, you need >> 1000+ descriptors. You're quite possibly up against your limit, though >> I don't know for sure that that's the problem here. >> >> You say you're running 1.2.8. That's "a while ago", so would you >> consider updating as a first step? Among other things, newer OMPIs will >> generate a much clearer error message if the descriptor limit is the >> problem. > > While 1.2.8 might be "a while ago", upgrading software just because it's > "old" is not a valid argument. > > I can install the lastest version of OpenMPI, but it will take a little > while.
Maybe not because it is "old", but Eugene is correct. The old versions of OMPI required more file descriptors than the newer versions. That said, you'll still need a minimum of 4x the number of procs on the node even with the latest release. I suggest talking to your sys admin about getting the limit increased. It sounds like it has been set unrealistically low. > > >>>>> I have a user trying to test his code on the command-line on a single >>>>> host before running it on our cluster like so: >>>>> >>>>> mpirun -np X foo >>>>> >>>>> When he tries to run it on large number of process (X = 256, 512), the >>>>> program fails, and I can reproduce this with a simple "Hello, World" >>>>> program: >>>>> >>>>> $ mpirun -np 256 mpihello >>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu >>>>> exited on signal 15 (Terminated). >>>>> 252 additional processes aborted (not shown) >>>>> >>>>> I've done some testing and found that X <155 for this program to work. >>>>> Is this a bug, part of the standard, or design/implementation decision? >>>>> >>>>> >>>>> >>>> One possible issue is the limit on the number of descriptors. The error >>>> message should be pretty helpful and descriptive, but perhaps you're >>>> using an older version of OMPI. If this is your problem, one workaround >>>> is something like this: >>>> >>>> unlimit descriptors >>>> mpirun -np 256 mpihello >>>> >>> >>> Looks like I'm not allowed to set that as a regular user: >>> >>> $ ulimit -n 2048 >>> -bash: ulimit: open files: cannot modify limit: Operation not permitted >>> >>> Since I am the admin, I could change that elsewhere, but I'd rather not >>> do that system-wide unless absolutely necessary. >>> >>>> though I guess the syntax depends on what shell you're running. Another >>>> is to set the MCA parameter opal_set_max_sys_limits to 1. >>>> >>> That didn't work either: >>> >>> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello >>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu >>> exited on signal 15 (Terminated). >>> 252 additional processes aborted (not shown) >>> >>> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > Prentice Bisbal > Linux Software Support Specialist/System Administrator > School of Natural Sciences > Institute for Advanced Study > Princeton, NJ > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users