Am 14.09.2011 um 00:29 schrieb Ralph Castain:

> 
> On Sep 13, 2011, at 4:25 PM, Reuti wrote:
> 
>> Am 13.09.2011 um 23:54 schrieb Blosch, Edwin L:
>> 
>>> This version of OpenMPI I am running was built without any guidance 
>>> regarding SGE in the configure command, but it was built on a system that 
>>> did not have SGE, so I would presume support is absent.
>> 
>> Whether SGE is installed on the built machine is not relevant. In contrast 
>> to Torque (and I think also SLURM) nothing is compiled into Open MPI which 
>> needs a library from the designated queuing system to support it. It will in 
>> case of SGE just check for the existence of some environment variables and 
>> call `qrsh -inherit ...`. Further startup is handled by SGE by the defined 
>> qrsh_daemon/qrsh_command.
>> 
>> So, to check it you can issue:
>> 
>> ompi_info | grep grid
> 
> Just an FYI: that could still yield no output and not mean that qrsh won't be 
> used by the launcher. The rsh launcher has the qrsh command embedded within 
> it, so it won't show on ompi_info.

Got it - thx. - Reuti


>> Any output?
>> 
>> 
>>> My hope is that OpenMPI will not attempt to use SGE in any way. But perhaps 
>>> it is trying to. 
>>> 
>>> Yes, I did supply a machinefile on my own.  It is formed on the fly within 
>>> the submitted script by parsing the PE_HOSTFILE, and I leave the
>> 
>> Parsing the PE_HOSTFILE and prepare it in a format suitable for the actual 
>> parallel library is usually defined in start_proc_args to do it once for all 
>> users and applications using this parallel library. With a tight integration 
>> they could be set to NONE though.
>> 
>> 
>>> resulting file lying around, and the result appears to be correct, i.e. it 
>>> includes those nodes (and only those nodes) allocated to the job.
>> 
>> Well, even without compilation --with-sge you could achieve a so called 
>> tight integration and confuse the startup when. What does your PE look like? 
>> Depending whether Open MPI will start an task on the master node of the job 
>> by a local `qrsh -inherit ...` job_is_first_task needs to be set to FALSE 
>> (this allows one `qrsh -inherit ...`call to be made local). But if all is 
>> fine, the job script is already the first task and TRUE should work.
>> 
>> -- Reuti
>> 
>> 
>>> -----Original Message-----
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>> Behalf Of Reuti
>>> Sent: Tuesday, September 13, 2011 4:27 PM
>>> To: Open MPI Users
>>> Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE
>>> 
>>> Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:
>>> 
>>>> I'm able to run this command below from an interactive shell window:
>>>> 
>>>> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent 
>>>> /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup
>>>> 
>>>> but it does not work if I put it into a shell script and 'qsub' that 
>>>> script to SGE.  I get the message shown at the bottom of this post. 
>>>> 
>>>> I've tried everything I can think of.  I would welcome any hints on how to 
>>>> proceed. 
>>>> 
>>>> For what it's worth, this OpenMPI is 1.4.3 and I built it on another 
>>>> system.  I am setting and exporting OPAL_PREFIX and as I said, all works 
>>>> fine interactively just not in batch.  It was built with -disable-shared 
>>>> and I don't see any shared libs under openmpi/lib, and I've done 'ldd' 
>>>> from within the script, on both the application executable and on the 
>>>> orterun command; no unresolved shared libraries.  So I don't think the 
>>>> error message hinting at LD_LIBRARY_PATH issues is pointing me in the 
>>>> right direction.
>>>> 
>>>> Thanks for any guidance,
>>>> 
>>>> Ed
>>>> 
>>> 
>>> Oh, I missed this:
>>> 
>>> 
>>>> error: executing task of job 139362 failed: execution daemon on host 
>>>> "f8312" didn't accept task
>>> 
>>> did you supply a machinefile on your own? In a proper SGE integration it's 
>>> running in a parallel environment. You defined and requested one? The error 
>>> looks like it was started in a PE, but tried to access a node not granted 
>>> for the actual job
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> --------------------------------------------------------------------------
>>>> A daemon (pid 2818) died unexpectedly with status 1 while attempting
>>>> to launch so we are aborting.
>>>> 
>>>> There may be more information reported by the environment (see above).
>>>> 
>>>> This may be because the daemon was unable to find all the needed shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>> --------------------------------------------------------------------------
>>>> mpirun: clean termination accomplished
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to