I would try attaching to the processes to see where things are
getting stuck.
On Oct 31, 2007, at 5:51 AM, Murat Knecht wrote:
Jeff Squyres schrieb:
On Oct 31, 2007, at 1:18 AM, Murat Knecht wrote:
Yes I am, (master and child 1 running on the same machine). But
knowing the oversubscribing issue, I am using mpi_yield_when_idle
which should fix precisely this problem, right?
It won't *fix* the problem -- you're still oversubscribing the
nodes, so things will run slowly. But it should help, in that the
processes will yield regularly.
Yes. I meant "solving the blocking problem by letting others get
some CPU time" by "fix".
What version of OMPI are you using?
I am using 1.2.4
I did give both machines multiple slots, so OpenMPI "knows" that
the possibility for more oversubscription may arise.
I'm not sure what you mean by this -- you should not "lie" to OMPI
and tell it that it has more slots than it physically does. But
keep in mind that, as I described in my first mail, OMPI does not
currently re-compute the number of processes on a host as you
spawn (which can lead to the oversubscription problem). If you're
explicitly setting yield_when_idle, that *may* help, but we may or
may not be explicitly propoagating that value to spawned
processes... I'll have to check.
In the hostfile I specified for each host the number of physically
available cores. Together with the "yield" setting I hoped the
oversubscription would be recognised even if the "oversubscribing"
processes are dynamically started.
I re-checked the high/low parameter, but it does seem alright.
Would be kind of awkward for this to be the reason, as the problem
seems to depend on the host and the order.
Thanks,
Murat
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems