Okay, let's try spreading them out more, just to avoid putting more on a node 
than you actually need. Add -bynode to your cmd line. This will spread the 
procs across all the nodes.

Our default mode is "byslot", which means we fill each node before adding procs 
to the next one. "bynode" puts one proc on each node, wrapping around until all 
procs have been assigned. You may lose a little performance as shared memory 
can't be used as much, but at least it has a better chance of running.


On Oct 14, 2011, at 1:29 PM, Ashwani Kumar Mishra wrote:

> Hi Ralph,
> No idea how much this program consumes the numbers of file descriptors :(
> 
> Best Regards,
> Ashwani
> 
> On Sat, Oct 15, 2011 at 12:08 AM, Ralph Castain <r...@open-mpi.org> wrote:
> Should be plenty for us - does your program consume a lot?
> 
> 
> On Oct 14, 2011, at 12:25 PM, Ashwani Kumar Mishra wrote:
> 
>> Hi Ralph,
>> fs.file-max = 100000
>> is this ok or less?
>> 
>> Best Regards,
>> Ashwani
>> 
>> 
>> On Fri, Oct 14, 2011 at 11:45 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> Can't offer much about the qsub job. On the first one, what is your limit on 
>> the number of file descriptors? Could be your sys admin has it too low.
>> 
>> 
>> On Oct 14, 2011, at 12:07 PM, Ashwani Kumar Mishra wrote:
>> 
>>> Hello,
>>> When i try to run the following command i receive the following error when 
>>> i try to submit this job on the cluster having 40 nodes with each node 
>>> having 8 processor & 8 GB RAM:
>>> 
>>> Both the command work well, as long as i use only upto 88 processors in the 
>>> cluster, but the moment i allocate more than 88 processors it gives me the 
>>> below 2 errors:
>>> 
>>> I tried to set the ulimit to unlimited & setting mca parameter 
>>> opal_set_max_sys_limits to 1 but still the problem wont go.
>>> 
>>> 
>>> $ mpirun=/opt/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50 n=10  
>>> in=s_1_1_sequence.txt
>>> 
>>> /opt/mpi/openmpi/1.3.3/intel/
>>> bin/mpirun -np 100 ABYSS-P -k50 -q3  --coverage-hist=coverage.hist -s 
>>> cattle-bubbles.fa  -o cattle-1.fa s_1_1_sequence.txt
>>> [coe:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>>> pipes a process can open was reached in file base/iof_base_setup.c at line 
>>> 107
>>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>>> pipes a process can open was reached in file odls_default_module.c at line 
>>> 203
>>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>>> network connections a process can open was reached in file oob_tcp.c at 
>>> line 447
>>> --------------------------------------------------------------------------
>>> Error: system limit exceeded on number of network connections that can be 
>>> open
>>> 
>>> This can be resolved by setting the mca parameter opal_set_max_sys_limits 
>>> to 1,
>>> increasing your limit descriptor setting (using limit or ulimit commands),
>>> or asking the system administrator to increase the system limit.
>>> --------------------------------------------------------------------------
>>> make: *** [cattle-1.fa] Error 1
>>> 
>>> 
>>> 
>>> 
>>> When i submit the same job through qsub, i receive the following error:
>>> $ qsub  -cwd -pe  orte 100 -o qsub.out -e qsub.err -b y -N  abyss `which 
>>> mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av
>>> 
>>> 
>>> [compute-0-19.local][[28273,1]
>>> ,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] 
>>> connect() to 173.16.255.231 failed: Connection refused (111)
>>> [compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>  connect() to 173.16.255.231 failed: Connection refused (111)
>>> [compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>  connect() to 173.16.255.228 failed: Connection refused (111)
>>> [compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>  connect() to 173.16.255.228 failed: Connection refused (111)
>>> [compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>>  connect() to 173.16.255.231 failed: Connection refused (111)
>>> 
>>> 
>>> 
>>> Best Regards,
>>> Ashwani
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to