This version of OpenMPI I am running was built without any guidance regarding 
SGE in the configure command, but it was built on a system that did not have 
SGE, so I would presume support is absent.

My hope is that OpenMPI will not attempt to use SGE in any way. But perhaps it 
is trying to. 

Yes, I did supply a machinefile on my own.  It is formed on the fly within the 
submitted script by parsing the PE_HOSTFILE, and I leave the resulting file 
lying around, and the result appears to be correct, i.e. it includes those 
nodes (and only those nodes) allocated to the job.



-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Reuti
Sent: Tuesday, September 13, 2011 4:27 PM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE

Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:

> I'm able to run this command below from an interactive shell window:
>  
> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent 
> /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup
>  
> but it does not work if I put it into a shell script and 'qsub' that script 
> to SGE.  I get the message shown at the bottom of this post. 
>  
> I've tried everything I can think of.  I would welcome any hints on how to 
> proceed. 
>  
> For what it's worth, this OpenMPI is 1.4.3 and I built it on another system.  
> I am setting and exporting OPAL_PREFIX and as I said, all works fine 
> interactively just not in batch.  It was built with -disable-shared and I 
> don't see any shared libs under openmpi/lib, and I've done 'ldd' from within 
> the script, on both the application executable and on the orterun command; no 
> unresolved shared libraries.  So I don't think the error message hinting at 
> LD_LIBRARY_PATH issues is pointing me in the right direction.
>  
> Thanks for any guidance,
>  
> Ed
>  

Oh, I missed this:


> error: executing task of job 139362 failed: execution daemon on host "f8312" 
> didn't accept task

did you supply a machinefile on your own? In a proper SGE integration it's 
running in a parallel environment. You defined and requested one? The error 
looks like it was started in a PE, but tried to access a node not granted for 
the actual job

-- Reuti


> --------------------------------------------------------------------------
> A daemon (pid 2818) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>  
> There may be more information reported by the environment (see above).
>  
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to