Eugene Loh wrote:
> Prentice Bisbal wrote:
>> Eugene Loh wrote:
>>   
>>> Prentice Bisbal wrote:
>>>     
>>>> Is there a limit on how many MPI processes can run on a single host?
>>>>       
> Depending on which OMPI release you're using, I think you need something
> like 4*np up to 7*np (plus a few) descriptors.  So, with 256, you need
> 1000+ descriptors.  You're quite possibly up against your limit, though
> I don't know for sure that that's the problem here.
> 
> You say you're running 1.2.8.  That's "a while ago", so would you
> consider updating as a first step?  Among other things, newer OMPIs will
> generate a much clearer error message if the descriptor limit is the
> problem.

While 1.2.8 might be "a while ago", upgrading software just because it's
"old" is not a valid argument.

I can install the lastest version of OpenMPI, but it will take a little
while.


>>>> I have a user trying to test his code on the command-line on a single
>>>> host before running it on our cluster like so:
>>>>
>>>> mpirun -np X foo
>>>>
>>>> When he tries to run it on large number of process (X = 256, 512), the
>>>> program fails, and I can reproduce this with a simple "Hello, World"
>>>> program:
>>>>
>>>> $ mpirun -np 256 mpihello
>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>> exited on signal 15 (Terminated).
>>>> 252 additional processes aborted (not shown)
>>>>
>>>> I've done some testing and found that X <155 for this program to work.
>>>> Is this a bug, part of the standard, or design/implementation decision?
>>>>  
>>>>
>>>>       
>>> One possible issue is the limit on the number of descriptors.  The error
>>> message should be pretty helpful and descriptive, but perhaps you're
>>> using an older version of OMPI.  If this is your problem, one workaround
>>> is something like this:
>>>
>>> unlimit descriptors
>>> mpirun -np 256 mpihello
>>>     
>>
>> Looks like I'm not allowed to set that as a regular user:
>>
>> $ ulimit -n 2048
>> -bash: ulimit: open files: cannot modify limit: Operation not permitted
>>
>> Since I am the admin, I could change that elsewhere, but I'd rather not
>> do that system-wide unless absolutely necessary.
>>   
>>> though I guess the syntax depends on what shell you're running.  Another
>>> is to set the MCA parameter opal_set_max_sys_limits to 1.
>>>     
>> That didn't work either:
>>
>> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello
>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>> exited on signal 15 (Terminated).
>> 252 additional processes aborted (not shown)
>>
>>   
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

Reply via email to