On Mar 5, 2010, at 2:38 PM, Ralph Castain wrote:

>> CALL SYSTEM("cd " // TRIM(dir) // " ; mpirun -machinefile ./machinefile -np 
>> 1 /home01/group/Execute/DLPOLY.X > job.out 2> job.err ; cd - > /dev/null")
> 
> That is guaranteed not to work. The problem is that mpirun sets environmental 
> variables for the original launch. Your system call carries over those 
> envars, causing mpirun to become confused.

You should be able to use MPI_COMM_SPAWN to launch this MPI job.  Check the man 
page for MPI_COMM_SPANW; I believe we have info keys to specify things like 
what hosts to launch on, etc.

>> Do you think MPI_COMM_SPAWN can help?
> 
> It's the only method supported by the MPI standard. If you need it to block 
> until this new executable completes, you could use a barrier or other MPI 
> method to determine it.

I believe that the user said they wanted to use the same cores as their 
original MPI job occupies for the new job -- they basically want the old job to 
block until the new job completes.  Keep in mind that OMPI busy-polls waiting 
for progress, so you might actually get hosed here (two procs competing for 
time on the same core).

I'm not immediately thinking of a good way to avoid this issue -- perhaps you 
could kludge something up such that the parent job polls on sleep() and 
checking to see if a message has arrived from the child (i.e., the last thing 
the child does before it calls MPI_FINALIZE is to send a message to its parents 
and then MPI_COMM_DISCONNECT from its parents).  If the parent finds that it 
has a message from the child(ren), it can MPI_COMM_DISCONNECT and continue 
processing.

Kinda hackey, but it might work...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to