This was addressed to the Open MPI list; on the SGE
list you suggested changing the pe allocation rule from full_up$ to
pe_slots$; the pe is now

[flengyel@nept OPENMPI]$ qconf -sp ompi
pe_name           ompi
slots             999
user_lists        Research
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $pe_slots
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

but the result is the same:

[flengyel@nept OPENMPI]$ tail -f sum.e23310
Starting server daemon at host "m18.gc.cuny.edu"
Server daemon successfully started with task id "1.m18"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
m18.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m18.gc.cuny.edu:26399] ERROR: A daemon on node m18.gc.cuny.edu failed to
start as expected.
[m18.gc.cuny.edu:26399] ERROR: There may be more information available from
[m18.gc.cuny.edu:26399] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[m18.gc.cuny.edu:26399] ERROR: If the problem persists, please restart the
[m18.gc.cuny.edu:26399] ERROR: Grid Engine PE job
[m18.gc.cuny.edu:26399] ERROR: The daemon exited unexpectedly with status
129.



On Tue, Jul 7, 2009 at 5:05 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Hi,
>
> Am 07.07.2009 um 22:12 schrieb Lengyel, Florian:
>
>  Hi,
>> I may have overlooked something in the archives (not to mention
>> Googling)--if so I apologize, however
>> I have been unable to find info on this particular problem.
>>
>> OpenMPI+SGE tight integration works on E6600 core duo systems but not on
>> Q9550 quads.
>> Could use some troubleshooting assistance. Thanks.
>>
>>  Is this what you found our your question?
>
> I'm not aware of this. What should be the cause of it?!? Do you have a link
> - was it on the SGE list?
>
> -- Reuti
>
>
>> I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.
>>
>> OpenMPI was compiled with SGE, and the required components are present:
>>
>> [flengyel@nept OPENMPI]$ ompi_info | grep gridengine
>>                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
>>                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
>>
>>
>> The parallel execution environment for OpenMPI is as follows:
>>
>> [flengyel@nept OPENMPI]$ qconf -sp ompi
>> pe_name           ompi
>> slots             999
>> user_lists        Research
>> xuser_lists       NONE
>> start_proc_args   /bin/true
>> stop_proc_args    /bin/true
>> allocation_rule   $fill_up
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>>
>> A trivial OpenMPI job using this pe will run on a queue for Intel E6600
>> core duo machines:
>>
>> [flengyel@nept OPENMPI]$ cat sum2.sh
>>
>> #!/bin/bash
>> #$ -S /bin/bash
>> #$ -q x86_64.q
>> #$ -N sum
>> #$ -pe ompi 4
>>
>> #$ -cwd
>>
>> export PATH=/home/nept/apps64/openmpi/bin:$PATH
>> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
>> . /usr/local/sge/default/common/settings.sh
>> mpirun --mca pls_gridengine_verbose 2  --prefix /home/nept/apps64/openmpi
>> -v  ./sum
>>
>> Here are the results:
>>
>> [flengyel@nept OPENMPI]$ qsub sum2.sh
>> Your job 23194 ("sum") has been submitted
>>
>> [flengyel@nept OPENMPI]$ qstat -r -u flengyel
>>
>> job-ID  prior   name       user         state submit/start at     queue
>>                        slots ja-task-ID
>>
>> -----------------------------------------------------------------------------------------------------------------
>>  23194 0.25007 sum        flengyel     r     07/07/2009 14:14:40
>> x86_6...@m49.gc.cuny.edu           4
>>       Full jobname:     sum
>>       Master queue:     x86_6...@m49.gc.cuny.edu
>>       Requested PE:     ompi 4
>>       Granted PE:       ompi 4
>>       Hard Resources:
>>       Soft Resources:
>>       Hard requested queues: x86_64.q
>>
>>
>> [flengyel@nept OPENMPI]$ more sum.o23194
>>
>> The sum from 1 to 1000 is: 500500
>> [flengyel@nept OPENMPI]$ more sum.e23194
>> Starting server daemon at host "m49.gc.cuny.edu"
>> Starting server daemon at host "m33.gc.cuny.edu"
>> Server daemon successfully started with task id "1.m49"
>> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
>> m49.gc.cuny.edu ...
>> Server daemon successfully started with task id "1.m33"
>> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
>> m33.gc.cuny.edu ...
>> /usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
>> reading exit code from shepherd ...
>>
>> But the same job with the queue set to quad.q for the Q9550 quad core
>> machines
>> has daemon trouble:
>>
>>
>> [flengyel@nept OPENMPI]$ !qstat
>> qstat -r -u flengyel
>> job-ID  prior   name       user         state submit/start at     queue
>>                        slots ja-task-ID
>>
>> -----------------------------------------------------------------------------------------------------------------
>>  23196 0.25000 sum        flengyel     r     07/07/2009 14:26:21
>> qua...@m09.gc.cuny.edu             2
>>       Full jobname:     sum
>>       Master queue:     qua...@m09.gc.cuny.edu
>>       Requested PE:     ompi 2
>>       Granted PE:       ompi 2
>>       Hard Resources:
>>       Soft Resources:
>>       Hard requested queues: quad.q
>> [flengyel@nept OPENMPI]$ more sum.e23196
>> Starting server daemon at host "m15.gc.cuny.edu"
>> Starting server daemon at host "m09.gc.cuny.edu"
>> Server daemon successfully started with task id "1.m15"
>> Server daemon successfully started with task id "1.m09"
>> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
>> m15.gc.cuny.e
>> du ...
>> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
>> reading exit code from shepherd ... Establishing
>> /usr/local/sge/utilbin/lx24-amd
>> 64/rsh session to host m09.gc.cuny.edu ...
>> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
>> reading exit code from shepherd ... 129
>> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to
>> start
>> as expected.
>> [m09.gc.cuny.edu:11413] ERROR: There may be more information available
>> from
>> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine
>> tasks.
>> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart
>> the
>> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
>> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status
>> 129.
>> 129
>> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to
>> start
>> as expected.
>> [m09.gc.cuny.edu:11413] ERROR: There may be more information available
>> from
>> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine
>> tasks.
>> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart
>> the
>> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
>> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status
>> 129.
>> [flengyel@nept OPENMPI]$
>>
>>
>> -FL
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do
>> ?dsForumId=38&dsMessageId=206057
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscr...@gridengine.sunsource.net].
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to