Hi,
Am 07.07.2009 um 22:12 schrieb Lengyel, Florian:
Hi,
I may have overlooked something in the archives (not to mention
Googling)--if so I apologize, however
I have been unable to find info on this particular problem.
OpenMPI+SGE tight integration works on E6600 core duo systems but
not on Q9550 quads.
Could use some troubleshooting assistance. Thanks.
Is this what you found our your question?
I'm not aware of this. What should be the cause of it?!? Do you have
a link - was it on the SGE list?
-- Reuti
I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.
OpenMPI was compiled with SGE, and the required components are
present:
[flengyel@nept OPENMPI]$ ompi_info | grep gridengine
MCA ras: gridengine (MCA v1.0, API v1.3, Component
v1.2.7)
MCA pls: gridengine (MCA v1.0, API v1.3, Component
v1.2.7)
The parallel execution environment for OpenMPI is as follows:
[flengyel@nept OPENMPI]$ qconf -sp ompi
pe_name ompi
slots 999
user_lists Research
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
A trivial OpenMPI job using this pe will run on a queue for Intel
E6600 core duo machines:
[flengyel@nept OPENMPI]$ cat sum2.sh
#!/bin/bash
#$ -S /bin/bash
#$ -q x86_64.q
#$ -N sum
#$ -pe ompi 4
#$ -cwd
export PATH=/home/nept/apps64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
. /usr/local/sge/default/common/settings.sh
mpirun --mca pls_gridengine_verbose 2 --prefix /home/nept/apps64/
openmpi -v ./sum
Here are the results:
[flengyel@nept OPENMPI]$ qsub sum2.sh
Your job 23194 ("sum") has been submitted
[flengyel@nept OPENMPI]$ qstat -r -u flengyel
job-ID prior name user state submit/start at
queue slots ja-task-ID
----------------------------------------------------------------------
-------------------------------------------
23194 0.25007 sum flengyel r 07/07/2009 14:14:40
x86_6...@m49.gc.cuny.edu 4
Full jobname: sum
Master queue: x86_6...@m49.gc.cuny.edu
Requested PE: ompi 4
Granted PE: ompi 4
Hard Resources:
Soft Resources:
Hard requested queues: x86_64.q
[flengyel@nept OPENMPI]$ more sum.o23194
The sum from 1 to 1000 is: 500500
[flengyel@nept OPENMPI]$ more sum.e23194
Starting server daemon at host "m49.gc.cuny.edu"
Starting server daemon at host "m33.gc.cuny.edu"
Server daemon successfully started with task id "1.m49"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
m49.gc.cuny.edu ...
Server daemon successfully started with task id "1.m33"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
m33.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ...
But the same job with the queue set to quad.q for the Q9550 quad
core machines
has daemon trouble:
[flengyel@nept OPENMPI]$ !qstat
qstat -r -u flengyel
job-ID prior name user state submit/start at
queue slots ja-task-ID
----------------------------------------------------------------------
-------------------------------------------
23196 0.25000 sum flengyel r 07/07/2009 14:26:21
qua...@m09.gc.cuny.edu 2
Full jobname: sum
Master queue: qua...@m09.gc.cuny.edu
Requested PE: ompi 2
Granted PE: ompi 2
Hard Resources:
Soft Resources:
Hard requested queues: quad.q
[flengyel@nept OPENMPI]$ more sum.e23196
Starting server daemon at host "m15.gc.cuny.edu"
Starting server daemon at host "m09.gc.cuny.edu"
Server daemon successfully started with task id "1.m15"
Server daemon successfully started with task id "1.m09"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
m15.gc.cuny.e
du ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... Establishing /usr/local/sge/
utilbin/lx24-amd
64/rsh session to host m09.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu
failed to start
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information
available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid
Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please
restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with
status 129.
129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu
failed to start
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information
available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid
Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please
restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with
status 129.
[flengyel@nept OPENMPI]$
-FL
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?
dsForumId=38&dsMessageId=206057
To unsubscribe from this discussion, e-mail: [users-
unsubscr...@gridengine.sunsource.net].
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users