I would try attaching to the processes to see where things are getting stuck.

On Oct 31, 2007, at 5:51 AM, Murat Knecht wrote:



Jeff Squyres schrieb:
On Oct 31, 2007, at 1:18 AM, Murat Knecht wrote:
Yes I am, (master and child 1 running on the same machine). But knowing the oversubscribing issue, I am using mpi_yield_when_idle which should fix precisely this problem, right?
It won't *fix* the problem -- you're still oversubscribing the nodes, so things will run slowly. But it should help, in that the processes will yield regularly.
Yes. I meant "solving the blocking problem by letting others get some CPU time" by "fix".

What version of OMPI are you using?
I am using 1.2.4

I did give both machines multiple slots, so OpenMPI "knows" that the possibility for more oversubscription may arise.
I'm not sure what you mean by this -- you should not "lie" to OMPI and tell it that it has more slots than it physically does. But keep in mind that, as I described in my first mail, OMPI does not currently re-compute the number of processes on a host as you spawn (which can lead to the oversubscription problem). If you're explicitly setting yield_when_idle, that *may* help, but we may or may not be explicitly propoagating that value to spawned processes... I'll have to check.
In the hostfile I specified for each host the number of physically available cores. Together with the "yield" setting I hoped the oversubscription would be recognised even if the "oversubscribing" processes are dynamically started. I re-checked the high/low parameter, but it does seem alright. Would be kind of awkward for this to be the reason, as the problem seems to depend on the host and the order.

Thanks,
Murat
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to