Ralph Castain wrote:
> On Mar 4, 2010, at 7:51 AM, Prentice Bisbal wrote:
> 
>>
>> Ralph Castain wrote:
>>> On Mar 4, 2010, at 7:27 AM, Prentice Bisbal wrote:
>>>
>>>> Ralph Castain wrote:
>>>>> On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote:
>>>>>
>>>>>> Eugene Loh wrote:
>>>>>>> Prentice Bisbal wrote:
>>>>>>>> Eugene Loh wrote:
>>>>>>>>
>>>>>>>>> Prentice Bisbal wrote:
>>>>>>>>>
>>>>>>>>>> Is there a limit on how many MPI processes can run on a single host?
>>>>>>>>>>
>>>>>>> Depending on which OMPI release you're using, I think you need something
>>>>>>> like 4*np up to 7*np (plus a few) descriptors.  So, with 256, you need
>>>>>>> 1000+ descriptors.  You're quite possibly up against your limit, though
>>>>>>> I don't know for sure that that's the problem here.
>>>>>>>
>>>>>>> You say you're running 1.2.8.  That's "a while ago", so would you
>>>>>>> consider updating as a first step?  Among other things, newer OMPIs will
>>>>>>> generate a much clearer error message if the descriptor limit is the
>>>>>>> problem.
>>>>>> While 1.2.8 might be "a while ago", upgrading software just because it's
>>>>>> "old" is not a valid argument.
>>>>>>
>>>>>> I can install the lastest version of OpenMPI, but it will take a little
>>>>>> while.
>>>>> Maybe not because it is "old", but Eugene is correct. The old versions of 
>>>>> OMPI required more file descriptors than the newer versions.
>>>>>
>>>>> That said, you'll still need a minimum of 4x the number of procs on the 
>>>>> node even with the latest release. I suggest talking to your sys admin 
>>>>> about getting the limit increased. It sounds like it has been set 
>>>>> unrealistically low.
>>>>>
>>>>>
>>>> I *am* the system admin! ;)
>>>>
>>>> The file descriptor limit is the default for RHEL,  1024, so I would not
>>>> characterize it as "unrealistically low".  I assume someone with much
>>>> more knowledge of OS design and administration than me came up with this
>>>> default, so I'm hesitant to change it without good reason. If there was
>>>> good reason, I'd have no problem changing it. I have read that setting
>>>> it to more than 8192 can lead to system instability.
>>> Never heard that, and most HPC systems have it set a great deal higher 
>>> without trouble.
>> I just read that the other day. Not sure where, though. Probably a forum
>> posting somewhere. I'll take your word for it that it's safe to increase
>> if necessary.
>>> However, the choice is yours. If you have a large SMP system, you'll 
>>> eventually be forced to change it or severely limit its usefulness for MPI. 
>>> RHEL sets it that low arbitrarily as a way of saving memory by keeping the 
>>> fd table small, not because the OS can't handle it.
>>>
>>> Anyway, that is the problem. Nothing we (or any MPI) can do about it as the 
>>> fd's are required for socket-based communications and to forward I/O.
>> Thanks, Ralph, that's exactly the answer I was looking for - where this
>> limit was coming from.
>>
>> I can see how on a large SMP system the fd limit would have to be
>> increased. In normal circumstances, my cluster nodes should never have
>> more than 8 MPI processes running at once (per node), so I shouldn't be
>> hitting that limit on my cluster.
> 
> Ah, okay! That helps a great deal in figuring out what to advise you. In your 
> earlier note, it sounded like you were running all 512 procs on one node, so 
> I assumed you had a large single-node SMP.
> 
> In this case, though, the problem is solely that you are using the 1.2 
> series. In that series, mpirun and each process opened many more sockets to 
> all processes in the job. That's why you are overrunning your limit.
> 
> Starting with 1.3, the number of sockets being opened on each is only 3 times 
> the number of procs on the node, plus a couple for the daemon. If you are 
> using TCP for MPI communications, then each MPI connection will open another 
> socket as these messages are direct and not routed.
> 
> Upgrading to the 1.4 series should resolve the problem you saw.

After upgrading to 1.4.1, I can start up to 253 processes on one host:

mpirun -np 253 mpihello

This is an increase of ~100 over 1.2.8. When it does fail, it gives more
useful error message:

$ mpirun -np 254 mpihello
[juno.sns.ias.edu:22862] [[6399,0],0] ORTE_ERROR_LOG: The system limit
on number of network connections a process can open was reached in file
../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
--------------------------------------------------------------------------
Error: system limit exceeded on number of network connections that can
be open

This can be resolved by setting the mca parameter
opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------


Case closed, court adjourned. Thanks for all the help and explanations.

Prentice


> 
> HTH
> Ralph
> 
>>>
>>>> This is admittedly unusual situation - in normal use, no one would ever
>>>> want to run that many processes on a single system - so I don't see any
>>>> justification for modifying that setting.
>>>>
>>>> Yesterday I spoke to the researcher who originally asked me this limit -
>>>> he just wanted to know what the limit was, and doesn't actually plan to
>>>> do any "real" work with that many processes on a single node, rendering
>>>> this whole discussion academic.
>>>>
>>>> I did install OpenMPI 1.4.1 yesterday, but I haven't had a chance to
>>>> test it yet. I'll post the results of testing here.
>>>>
>>>>>>>>>> I have a user trying to test his code on the command-line on a single
>>>>>>>>>> host before running it on our cluster like so:
>>>>>>>>>>
>>>>>>>>>> mpirun -np X foo
>>>>>>>>>>
>>>>>>>>>> When he tries to run it on large number of process (X = 256, 512), 
>>>>>>>>>> the
>>>>>>>>>> program fails, and I can reproduce this with a simple "Hello, World"
>>>>>>>>>> program:
>>>>>>>>>>
>>>>>>>>>> $ mpirun -np 256 mpihello
>>>>>>>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>>>>>>>> exited on signal 15 (Terminated).
>>>>>>>>>> 252 additional processes aborted (not shown)
>>>>>>>>>>
>>>>>>>>>> I've done some testing and found that X <155 for this program to 
>>>>>>>>>> work.
>>>>>>>>>> Is this a bug, part of the standard, or design/implementation 
>>>>>>>>>> decision?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> One possible issue is the limit on the number of descriptors.  The 
>>>>>>>>> error
>>>>>>>>> message should be pretty helpful and descriptive, but perhaps you're
>>>>>>>>> using an older version of OMPI.  If this is your problem, one 
>>>>>>>>> workaround
>>>>>>>>> is something like this:
>>>>>>>>>
>>>>>>>>> unlimit descriptors
>>>>>>>>> mpirun -np 256 mpihello
>>>>>>>>>
>>>>>>>> Looks like I'm not allowed to set that as a regular user:
>>>>>>>>
>>>>>>>> $ ulimit -n 2048
>>>>>>>> -bash: ulimit: open files: cannot modify limit: Operation not permitted
>>>>>>>>
>>>>>>>> Since I am the admin, I could change that elsewhere, but I'd rather not
>>>>>>>> do that system-wide unless absolutely necessary.
>>>>>>>>
>>>>>>>>> though I guess the syntax depends on what shell you're running.  
>>>>>>>>> Another
>>>>>>>>> is to set the MCA parameter opal_set_max_sys_limits to 1.
>>>>>>>>>
>>>>>>>> That didn't work either:
>>>>>>>>
>>>>>>>> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello
>>>>>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>>>>>> exited on signal 15 (Terminated).
>>>>>>>> 252 additional processes aborted (not shown)
>>>> -- 
>>>> Prentice Bisbal
>>>> Linux Software Support Specialist/System Administrator
>>>> School of Natural Sciences
>>>> Institute for Advanced Study
>>>> Princeton, NJ
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> -- 
>> Prentice Bisbal
>> Linux Software Support Specialist/System Administrator
>> School of Natural Sciences
>> Institute for Advanced Study
>> Princeton, NJ
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

Reply via email to