Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-24 Thread Ralph Castain
The "updated"field in the orte_job_t structure is only used to help reduce the size of the launch message sent to all the daemons. Basically, we only include info on jobs that have been changed - thus, it only gets used when the app calls comm_spawn. After every launch, we automatically change i

Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-18 Thread tmishima
I confirmed your fix worked good for me. But, I guess at least we should add the line "daemons->updated = false;" in the last if-clause, although I'm not sure how the variable is used. Is it okay, Ralph? Tetsuya > Understood, and your logic is correct. It's just that I'd rather each launcher de

Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread Ralph Castain
Understood, and your logic is correct. It's just that I'd rather each launcher decide to declare the daemons as reported rather than doing it in the common code, just in case someone writes a launcher where they choose to respond differently to the case where no new daemons need to be launched.

Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread tmishima
I do not understand your fix yet, but it would be better, I guess. I'll check it later, but now please let me expalin what I thought: If some nodes are allocated, it doen't go through this part because opal_list_get_size(&nodes) > 0 at this location. 1590if (0 == opal_list_get_size(&nodes)

Re: [OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread Ralph Castain
Hmm...no, I don't think that's the correct patch. We want that function to remain "clean" as it's job is simply to construct the list of nodes for the VM. It's the responsibility of the launcher to decide what to do with it. Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix Ra

[OMPI users] another corner case hangup in openmpi-1.7.5rc3

2014-03-17 Thread tmishima
Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3. Condition: 1. allocate some nodes using RM such as TORQUE. 2. request the head node only in executing the job with -host or -hostfile option. Example: 1. allocate node05,node06 using TORQUE. 2. request node05 only with -host op