Am 14.09.2011 um 00:29 schrieb Ralph Castain: > > On Sep 13, 2011, at 4:25 PM, Reuti wrote: > >> Am 13.09.2011 um 23:54 schrieb Blosch, Edwin L: >> >>> This version of OpenMPI I am running was built without any guidance >>> regarding SGE in the configure command, but it was built on a system that >>> did not have SGE, so I would presume support is absent. >> >> Whether SGE is installed on the built machine is not relevant. In contrast >> to Torque (and I think also SLURM) nothing is compiled into Open MPI which >> needs a library from the designated queuing system to support it. It will in >> case of SGE just check for the existence of some environment variables and >> call `qrsh -inherit ...`. Further startup is handled by SGE by the defined >> qrsh_daemon/qrsh_command. >> >> So, to check it you can issue: >> >> ompi_info | grep grid > > Just an FYI: that could still yield no output and not mean that qrsh won't be > used by the launcher. The rsh launcher has the qrsh command embedded within > it, so it won't show on ompi_info.
Got it - thx. - Reuti >> Any output? >> >> >>> My hope is that OpenMPI will not attempt to use SGE in any way. But perhaps >>> it is trying to. >>> >>> Yes, I did supply a machinefile on my own. It is formed on the fly within >>> the submitted script by parsing the PE_HOSTFILE, and I leave the >> >> Parsing the PE_HOSTFILE and prepare it in a format suitable for the actual >> parallel library is usually defined in start_proc_args to do it once for all >> users and applications using this parallel library. With a tight integration >> they could be set to NONE though. >> >> >>> resulting file lying around, and the result appears to be correct, i.e. it >>> includes those nodes (and only those nodes) allocated to the job. >> >> Well, even without compilation --with-sge you could achieve a so called >> tight integration and confuse the startup when. What does your PE look like? >> Depending whether Open MPI will start an task on the master node of the job >> by a local `qrsh -inherit ...` job_is_first_task needs to be set to FALSE >> (this allows one `qrsh -inherit ...`call to be made local). But if all is >> fine, the job script is already the first task and TRUE should work. >> >> -- Reuti >> >> >>> -----Original Message----- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >>> Behalf Of Reuti >>> Sent: Tuesday, September 13, 2011 4:27 PM >>> To: Open MPI Users >>> Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE >>> >>> Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L: >>> >>>> I'm able to run this command below from an interactive shell window: >>>> >>>> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent >>>> /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup >>>> >>>> but it does not work if I put it into a shell script and 'qsub' that >>>> script to SGE. I get the message shown at the bottom of this post. >>>> >>>> I've tried everything I can think of. I would welcome any hints on how to >>>> proceed. >>>> >>>> For what it's worth, this OpenMPI is 1.4.3 and I built it on another >>>> system. I am setting and exporting OPAL_PREFIX and as I said, all works >>>> fine interactively just not in batch. It was built with -disable-shared >>>> and I don't see any shared libs under openmpi/lib, and I've done 'ldd' >>>> from within the script, on both the application executable and on the >>>> orterun command; no unresolved shared libraries. So I don't think the >>>> error message hinting at LD_LIBRARY_PATH issues is pointing me in the >>>> right direction. >>>> >>>> Thanks for any guidance, >>>> >>>> Ed >>>> >>> >>> Oh, I missed this: >>> >>> >>>> error: executing task of job 139362 failed: execution daemon on host >>>> "f8312" didn't accept task >>> >>> did you supply a machinefile on your own? In a proper SGE integration it's >>> running in a parallel environment. You defined and requested one? The error >>> looks like it was started in a PE, but tried to access a node not granted >>> for the actual job >>> >>> -- Reuti >>> >>> >>>> -------------------------------------------------------------------------- >>>> A daemon (pid 2818) died unexpectedly with status 1 while attempting >>>> to launch so we are aborting. >>>> >>>> There may be more information reported by the environment (see above). >>>> >>>> This may be because the daemon was unable to find all the needed shared >>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >>>> location of the shared libraries on the remote nodes and this will >>>> automatically be forwarded to the remote nodes. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> -------------------------------------------------------------------------- >>>> mpirun: clean termination accomplished >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users