Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3.

Condition:
1. allocate some nodes using RM such as TORQUE.
2. request the head node only in executing the job with
   -host or -hostfile option.

Example:
1. allocate node05,node06 using TORQUE.
2. request node05 only with -host option

[mishima@manage ~]$ qsub -I -l nodes=node05+node06
qsub: waiting for job 8661.manage.cluster to start
qsub: job 8661.manage.cluster ready

[mishima@node05 ~]$ cat $PBS_NODEFILE
node05
node06
[mishima@node05 ~]$ mpirun -np 1 -host node05 ~/mis/openmpi/demos/myprog
<< hang here >>

And, my fix for plm_base_launch_support.c is as follows:
--- plm_base_launch_support.c   2014-03-12 05:51:45.000000000 +0900
+++ plm_base_launch_support.try.c       2014-03-18 08:38:03.000000000 +0900
@@ -1662,7 +1662,11 @@
         OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
                              "%s plm:base:setup_vm only HNP left",
                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+        /* cleanup */
         OBJ_DESTRUCT(&nodes);
+        /* mark that the daemons have reported so we can proceed */
+        daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
+        daemons->updated = false;
         return ORTE_SUCCESS;
     }

Tetsuya

Reply via email to