Hi, I may have overlooked something in the archives (not to mention Googling)--if so I apologize, however I have been unable to find info on this particular problem.
OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads. Could use some troubleshooting assistance. Thanks. I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11. OpenMPI was compiled with SGE, and the required components are present: [flengyel@nept OPENMPI]$ ompi_info | grep gridengine MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) The parallel execution environment for OpenMPI is as follows: [flengyel@nept OPENMPI]$ qconf -sp ompi pe_name ompi slots 999 user_lists Research xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min A trivial OpenMPI job using this pe will run on a queue for Intel E6600 core duo machines: [flengyel@nept OPENMPI]$ cat sum2.sh #!/bin/bash #$ -S /bin/bash #$ -q x86_64.q #$ -N sum #$ -pe ompi 4 #$ -cwd export PATH=/home/nept/apps64/openmpi/bin:$PATH export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib . /usr/local/sge/default/common/settings.sh mpirun --mca pls_gridengine_verbose 2 --prefix /home/nept/apps64/openmpi -v ./sum Here are the results: [flengyel@nept OPENMPI]$ qsub sum2.sh Your job 23194 ("sum") has been submitted [flengyel@nept OPENMPI]$ qstat -r -u flengyel job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 23194 0.25007 sum flengyel r 07/07/2009 14:14:40 x86_6...@m49.gc.cuny.edu 4 Full jobname: sum Master queue: x86_6...@m49.gc.cuny.edu Requested PE: ompi 4 Granted PE: ompi 4 Hard Resources: Soft Resources: Hard requested queues: x86_64.q [flengyel@nept OPENMPI]$ more sum.o23194 The sum from 1 to 1000 is: 500500 [flengyel@nept OPENMPI]$ more sum.e23194 Starting server daemon at host "m49.gc.cuny.edu" Starting server daemon at host "m33.gc.cuny.edu" Server daemon successfully started with task id "1.m49" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m49.gc.cuny.edu ... Server daemon successfully started with task id "1.m33" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m33.gc.cuny.edu ... /usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... But the same job with the queue set to quad.q for the Q9550 quad core machines has daemon trouble: [flengyel@nept OPENMPI]$ !qstat qstat -r -u flengyel job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 23196 0.25000 sum flengyel r 07/07/2009 14:26:21 qua...@m09.gc.cuny.edu 2 Full jobname: sum Master queue: qua...@m09.gc.cuny.edu Requested PE: ompi 2 Granted PE: ompi 2 Hard Resources: Soft Resources: Hard requested queues: quad.q [flengyel@nept OPENMPI]$ more sum.e23196 Starting server daemon at host "m15.gc.cuny.edu" Starting server daemon at host "m09.gc.cuny.edu" Server daemon successfully started with task id "1.m15" Server daemon successfully started with task id "1.m09" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m15.gc.cuny.e du ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) reading exit code from shepherd ... Establishing /usr/local/sge/utilbin/lx24-amd 64/rsh session to host m09.gc.cuny.edu ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) reading exit code from shepherd ... 129 [m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to start as expected. [m09.gc.cuny.edu:11413] ERROR: There may be more information available from [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks. [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129. 129 [m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to start as expected. [m09.gc.cuny.edu:11413] ERROR: There may be more information available from [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks. [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129. [flengyel@nept OPENMPI]$ -FL ------------------------------------------------------ http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206057 To unsubscribe from this discussion, e-mail: [users-unsubscr...@gridengine.sunsource.net].