On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote
On Mar 14, 2012, at 5:44 PM, Reuti wrote:
(I was just typing when Ralph's message came in: I can confirm this. To
avoid it, it would mean for Open MPI to collect all lines from the
hostfile which are on the same machine. SGE creates entries for each
queue/host pair in the machine file).
Hmmm…I can take a look at the allocator module and see why we aren't
doing it. Would the host names be the same for the two queues?
I can't speak authoritatively like Reuti can, but here's what a hostfile
looks like on my cluster (note that all our name resolution is done via
/etc/hosts -- there's no DNS involved):
iq103 8 lab.q@iq103 <NULL>
iq103 1 test.q@iq103 <NULL>
iq104 8 lab.q@iq104 <NULL>
iq104 1 test.q@iq104 <NULL>
opt221 2 lab.q@opt221 <NULL>
opt221 1 test.q@opt221 <NULL>
@Ralph: it could work if SGE would have a facility to request the
desired queue in `qrsh -inherit ...`, because then the $TMPDIR would be
unique for each orted again (assuming its using different ports for
each).
Gotcha! I suspect getting the allocator to handle this cleanly is the
better solution, though.
If I can help (testing patches, e.g.), let me know.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF