On Thu, 15 Mar 2012 at 12:44am, Reuti wrote
Which version of SGE are you using? The traditional rsh startup was
replaced by the builtin startup some time ago (although it should still
work).
We're currently running the rather ancient 6.1u4 (due to the "If it ain't
broke..." philosophy). The hardware for our new queue master recently
arrived and I'll soon be upgrading to the most recent Open Grid Scheduler
release. Are you saying that the upgrade with the new builtin startup
method should avoid this problem?
Maybe this shows already the problem: there are two `qrsh -inherit`, as
Open MPI thinks these are different machines (I ran only with one slot
on each host hence didn't get it first but can reproduce it now). But
for SGE both may end up in the same queue overriding the openmpi-session
in $TMPDIR.
Although it's running: you get all output? If I request 4 slots and get
one from each queue on both machines the mpihello outputs only 3 lines:
the "Hello World from Node 3" is always missing.
I do seem to get all the output -- there are indeed 64 Hello World lines.
Thanks again for all the help on this. This is one of the most productive
exchanges I've had on a mailing list in far too long.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF