Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3.
Condition:
1. allocate some nodes using RM such as TORQUE.
2. request the head node only in executing the job with
-host or -hostfile option.
Example:
1. allocate node05,node06 using TORQUE.
2. request node05 only with -host option
[mishima@manage ~]$ qsub -I -l nodes=node05+node06
qsub: waiting for job 8661.manage.cluster to start
qsub: job 8661.manage.cluster ready
[mishima@node05 ~]$ cat $PBS_NODEFILE
node05
node06
[mishima@node05 ~]$ mpirun -np 1 -host node05 ~/mis/openmpi/demos/myprog
<< hang here >>
And, my fix for plm_base_launch_support.c is as follows:
--- plm_base_launch_support.c 2014-03-12 05:51:45.000000000 +0900
+++ plm_base_launch_support.try.c 2014-03-18 08:38:03.000000000 +0900
@@ -1662,7 +1662,11 @@
OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
"%s plm:base:setup_vm only HNP left",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+ /* cleanup */
OBJ_DESTRUCT(&nodes);
+ /* mark that the daemons have reported so we can proceed */
+ daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
+ daemons->updated = false;
return ORTE_SUCCESS;
}
Tetsuya