Re: [OMPI users] EXTERNAL: Re: Problem running under SGE

Blosch, Edwin L Tue, 13 Sep 2011 19:12:19 -0400

We don't budget computer hours so I don't think we would use accounting, 
although I'm not sure I know what this capability is all about. Also, I don't 
care about launch speed. A few minutes means nothing when the job will take 
days to run. Also, I have a highly portable strategy of wrapping the mpirun 
command with a shell script that figures out how many processes are allocated 
to the job and explicitly tells OpenMPI how many hosts to use and which ones.  
I can adapt that script in very minor ways to support any job-queueing system 
past present or future, and my invocation of the mpirun command remains the 
same and should always work.


For these reasons I have preferred the rsh/ssh launcher, the less intelligent 
the better. I'm sure there are benefits of tight integration, as you said, 
perhaps you can keep users from accidentally or intentionally using nodes 
outside their allocation. It's just not an issue for us.

I will check the FAQ to see if I can learn more about the benefits of tight 
integration with a job-queueing system.


Thank you again for the help


-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Reuti
Sent: Tuesday, September 13, 2011 5:36 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: Problem running under SGE

Am 14.09.2011 um 00:25 schrieb Blosch, Edwin L:

> Your comment guided me in the right direction, Reuti. And overlapped with 
> your guidance, Ralph.
> 
> It works: if I add this flag then it runs
> --mca plm_rsh_disable_qrsh
> 
> Thank you both for the explanations.  
> 
> I had built OpenMPI on another system, as I said, it did not have SGE and 
> thus I did not give --without-sge (nor did I give --with-sge).  In the future 
> for building 1.4.3 I will just add --without-sge and presumably I won't run 
> into the qrsh issue.

Can I understand this in a way, that you don't want a tight integration with 
correct accounting, but prefer to run slave tasks by rsh/ssh on your own? This 
can lead to oversubscribed machines in case some users' scripts are not 
honoring the machinefile in the correct way.

Having a tight integration (with disabled ssh/rsh inside the cluster) is the 
setup I usually prefer.

-- Reuti


> Thanks again
> 
> 
> 
> 
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Reuti
> Sent: Tuesday, September 13, 2011 4:27 PM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE
> 
> Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:
> 
>> I'm able to run this command below from an interactive shell window:
>> 
>> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent 
>> /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup
>> 
>> but it does not work if I put it into a shell script and 'qsub' that script 
>> to SGE.  I get the message shown at the bottom of this post. 
>> 
>> I've tried everything I can think of.  I would welcome any hints on how to 
>> proceed. 
>> 
>> For what it's worth, this OpenMPI is 1.4.3 and I built it on another system. 
>>  I am setting and exporting OPAL_PREFIX and as I said, all works fine 
>> interactively just not in batch.  It was built with -disable-shared and I 
>> don't see any shared libs under openmpi/lib, and I've done 'ldd' from within 
>> the script, on both the application executable and on the orterun command; 
>> no unresolved shared libraries.  So I don't think the error message hinting 
>> at LD_LIBRARY_PATH issues is pointing me in the right direction.
>> 
>> Thanks for any guidance,
>> 
>> Ed
>> 
> 
> Oh, I missed this:
> 
> 
>> error: executing task of job 139362 failed: execution daemon on host "f8312" 
>> didn't accept task
> 
> did you supply a machinefile on your own? In a proper SGE integration it's 
> running in a parallel environment. You defined and requested one? The error 
> looks like it was started in a PE, but tried to access a node not granted for 
> the actual job
> 
> -- Reuti
> 
> 
>> --------------------------------------------------------------------------
>> A daemon (pid 2818) died unexpectedly with status 1 while attempting
>> to launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] EXTERNAL: Re: Problem running under SGE

Reply via email to