Am 14.09.2011 um 00:25 schrieb Blosch, Edwin L: > Your comment guided me in the right direction, Reuti. And overlapped with > your guidance, Ralph. > > It works: if I add this flag then it runs > --mca plm_rsh_disable_qrsh > > Thank you both for the explanations. > > I had built OpenMPI on another system, as I said, it did not have SGE and > thus I did not give --without-sge (nor did I give --with-sge). In the future > for building 1.4.3 I will just add --without-sge and presumably I won't run > into the qrsh issue.
Can I understand this in a way, that you don't want a tight integration with correct accounting, but prefer to run slave tasks by rsh/ssh on your own? This can lead to oversubscribed machines in case some users' scripts are not honoring the machinefile in the correct way. Having a tight integration (with disabled ssh/rsh inside the cluster) is the setup I usually prefer. -- Reuti > Thanks again > > > > > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Reuti > Sent: Tuesday, September 13, 2011 4:27 PM > To: Open MPI Users > Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE > > Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L: > >> I'm able to run this command below from an interactive shell window: >> >> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent >> /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup >> >> but it does not work if I put it into a shell script and 'qsub' that script >> to SGE. I get the message shown at the bottom of this post. >> >> I've tried everything I can think of. I would welcome any hints on how to >> proceed. >> >> For what it's worth, this OpenMPI is 1.4.3 and I built it on another system. >> I am setting and exporting OPAL_PREFIX and as I said, all works fine >> interactively just not in batch. It was built with -disable-shared and I >> don't see any shared libs under openmpi/lib, and I've done 'ldd' from within >> the script, on both the application executable and on the orterun command; >> no unresolved shared libraries. So I don't think the error message hinting >> at LD_LIBRARY_PATH issues is pointing me in the right direction. >> >> Thanks for any guidance, >> >> Ed >> > > Oh, I missed this: > > >> error: executing task of job 139362 failed: execution daemon on host "f8312" >> didn't accept task > > did you supply a machinefile on your own? In a proper SGE integration it's > running in a parallel environment. You defined and requested one? The error > looks like it was started in a PE, but tried to access a node not granted for > the actual job > > -- Reuti > > >> -------------------------------------------------------------------------- >> A daemon (pid 2818) died unexpectedly with status 1 while attempting >> to launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> mpirun: clean termination accomplished >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users