Re: [OMPI users] scaling issue beyond 1024 processes

2011-08-11 Thread CB
The job was dispatched by SGE scheduler but the mpi hello binary never gets executed on compute nodes. It appears that the OpenMPI orted is waiting for something as shown below: Master node: 4465 ?Sl 0:05 /usr/local/sge/latest/bin/lx26-amd64/sge_execd 4677 ?S 0:00 \_

Re: [OMPI users] scaling issue beyond 1024 processes

2011-08-10 Thread Ralph Castain
When you say "stuck", what actually happens? On Aug 10, 2011, at 2:09 PM, CB wrote: > Now I was able to run MPI hello world example up to 3096 processes across 129 > nodes (24 cores per node). > However, it seems to get stuck with 3097 processes. > > Any suggestions for troubleshooting? > > Th

Re: [OMPI users] scaling issue beyond 1024 processes

2011-08-10 Thread CB
Now I was able to run MPI hello world example up to 3096 processes across 129 nodes (24 cores per node). However, it seems to get stuck with 3097 processes. Any suggestions for troubleshooting? Thanks, - Chansup On Tue, Aug 9, 2011 at 2:02 PM, CB wrote: > Hi Ralph, > > Yes, you are right. Tho

Re: [OMPI users] scaling issue beyond 1024 processes

2011-08-09 Thread CB
Hi Ralph, Yes, you are right. Those nodes were still pointing to an old version. I'll check the installation on all nodes and try to run it again. Thanks, - Chansup On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain wrote: > That error makes no sense - line 335 is just a variable declaration. Sure

Re: [OMPI users] scaling issue beyond 1024 processes

2011-08-09 Thread Ralph Castain
That error makes no sense - line 335 is just a variable declaration. Sure you are not picking up a different version on that node? On Aug 9, 2011, at 11:37 AM, CB wrote: > Hi, > > Currently I'm having trouble to scale an MPI job beyond a certain limit. > So I'm running an MPI hello example to

[OMPI users] scaling issue beyond 1024 processes

2011-08-09 Thread CB
Hi, Currently I'm having trouble to scale an MPI job beyond a certain limit. So I'm running an MPI hello example to test beyond 1024 but it failed with the following error with 2048 processes. It worked fine with 1024 processes. I have enough file descriptor limit (65536) defined for each process