This was addressed to the Open MPI list; on the SGE list you suggested changing the pe allocation rule from full_up$ to pe_slots$; the pe is now
[flengyel@nept OPENMPI]$ qconf -sp ompi pe_name ompi slots 999 user_lists Research xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves TRUE job_is_first_task FALSE urgency_slots min but the result is the same: [flengyel@nept OPENMPI]$ tail -f sum.e23310 Starting server daemon at host "m18.gc.cuny.edu" Server daemon successfully started with task id "1.m18" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m18.gc.cuny.edu ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) reading exit code from shepherd ... 129 [m18.gc.cuny.edu:26399] ERROR: A daemon on node m18.gc.cuny.edu failed to start as expected. [m18.gc.cuny.edu:26399] ERROR: There may be more information available from [m18.gc.cuny.edu:26399] ERROR: the 'qstat -t' command on the Grid Engine tasks. [m18.gc.cuny.edu:26399] ERROR: If the problem persists, please restart the [m18.gc.cuny.edu:26399] ERROR: Grid Engine PE job [m18.gc.cuny.edu:26399] ERROR: The daemon exited unexpectedly with status 129. On Tue, Jul 7, 2009 at 5:05 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 07.07.2009 um 22:12 schrieb Lengyel, Florian: > > Hi, >> I may have overlooked something in the archives (not to mention >> Googling)--if so I apologize, however >> I have been unable to find info on this particular problem. >> >> OpenMPI+SGE tight integration works on E6600 core duo systems but not on >> Q9550 quads. >> Could use some troubleshooting assistance. Thanks. >> >> Is this what you found our your question? > > I'm not aware of this. What should be the cause of it?!? Do you have a link > - was it on the SGE list? > > -- Reuti > > >> I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11. >> >> OpenMPI was compiled with SGE, and the required components are present: >> >> [flengyel@nept OPENMPI]$ ompi_info | grep gridengine >> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) >> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) >> >> >> The parallel execution environment for OpenMPI is as follows: >> >> [flengyel@nept OPENMPI]$ qconf -sp ompi >> pe_name ompi >> slots 999 >> user_lists Research >> xuser_lists NONE >> start_proc_args /bin/true >> stop_proc_args /bin/true >> allocation_rule $fill_up >> control_slaves TRUE >> job_is_first_task FALSE >> urgency_slots min >> >> A trivial OpenMPI job using this pe will run on a queue for Intel E6600 >> core duo machines: >> >> [flengyel@nept OPENMPI]$ cat sum2.sh >> >> #!/bin/bash >> #$ -S /bin/bash >> #$ -q x86_64.q >> #$ -N sum >> #$ -pe ompi 4 >> >> #$ -cwd >> >> export PATH=/home/nept/apps64/openmpi/bin:$PATH >> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib >> . /usr/local/sge/default/common/settings.sh >> mpirun --mca pls_gridengine_verbose 2 --prefix /home/nept/apps64/openmpi >> -v ./sum >> >> Here are the results: >> >> [flengyel@nept OPENMPI]$ qsub sum2.sh >> Your job 23194 ("sum") has been submitted >> >> [flengyel@nept OPENMPI]$ qstat -r -u flengyel >> >> job-ID prior name user state submit/start at queue >> slots ja-task-ID >> >> ----------------------------------------------------------------------------------------------------------------- >> 23194 0.25007 sum flengyel r 07/07/2009 14:14:40 >> x86_6...@m49.gc.cuny.edu 4 >> Full jobname: sum >> Master queue: x86_6...@m49.gc.cuny.edu >> Requested PE: ompi 4 >> Granted PE: ompi 4 >> Hard Resources: >> Soft Resources: >> Hard requested queues: x86_64.q >> >> >> [flengyel@nept OPENMPI]$ more sum.o23194 >> >> The sum from 1 to 1000 is: 500500 >> [flengyel@nept OPENMPI]$ more sum.e23194 >> Starting server daemon at host "m49.gc.cuny.edu" >> Starting server daemon at host "m33.gc.cuny.edu" >> Server daemon successfully started with task id "1.m49" >> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host >> m49.gc.cuny.edu ... >> Server daemon successfully started with task id "1.m33" >> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host >> m33.gc.cuny.edu ... >> /usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0 >> reading exit code from shepherd ... >> >> But the same job with the queue set to quad.q for the Q9550 quad core >> machines >> has daemon trouble: >> >> >> [flengyel@nept OPENMPI]$ !qstat >> qstat -r -u flengyel >> job-ID prior name user state submit/start at queue >> slots ja-task-ID >> >> ----------------------------------------------------------------------------------------------------------------- >> 23196 0.25000 sum flengyel r 07/07/2009 14:26:21 >> qua...@m09.gc.cuny.edu 2 >> Full jobname: sum >> Master queue: qua...@m09.gc.cuny.edu >> Requested PE: ompi 2 >> Granted PE: ompi 2 >> Hard Resources: >> Soft Resources: >> Hard requested queues: quad.q >> [flengyel@nept OPENMPI]$ more sum.e23196 >> Starting server daemon at host "m15.gc.cuny.edu" >> Starting server daemon at host "m09.gc.cuny.edu" >> Server daemon successfully started with task id "1.m15" >> Server daemon successfully started with task id "1.m09" >> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host >> m15.gc.cuny.e >> du ... >> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) >> reading exit code from shepherd ... Establishing >> /usr/local/sge/utilbin/lx24-amd >> 64/rsh session to host m09.gc.cuny.edu ... >> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) >> reading exit code from shepherd ... 129 >> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to >> start >> as expected. >> [m09.gc.cuny.edu:11413] ERROR: There may be more information available >> from >> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine >> tasks. >> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart >> the >> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job >> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status >> 129. >> 129 >> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to >> start >> as expected. >> [m09.gc.cuny.edu:11413] ERROR: There may be more information available >> from >> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine >> tasks. >> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart >> the >> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job >> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status >> 129. >> [flengyel@nept OPENMPI]$ >> >> >> -FL >> >> ------------------------------------------------------ >> http://gridengine.sunsource.net/ds/viewMessage.do >> ?dsForumId=38&dsMessageId=206057 >> >> To unsubscribe from this discussion, e-mail: [users- >> unsubscr...@gridengine.sunsource.net]. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >