FWIW: I have fixed this on the developer's trunk, and Jeff has scheduled it for 
release in the upcoming 1.6 release (when 1.5 series rolls over). I don't 
expect we'll backport it to 1.4 unless someone really needs it there.

Thanks!
Ralph

On Feb 1, 2012, at 9:31 AM, Ralph Castain wrote:

> Ah - crud. Looks like the default-hostfile mca param isn't getting set to the 
> default value. Will resolve - thanks!
> 
> On Feb 1, 2012, at 9:28 AM, Reuti wrote:
> 
>> Am 01.02.2012 um 17:16 schrieb Ralph Castain:
>> 
>>> Could you add --display-allocation to your cmd line? This will tell us if 
>>> it found/read the default hostfile, or if the problem is with the mapper.
>> 
>> Sure:
>> 
>> reuti@pc15370:~> mpiexec --display-allocation -np 4 ./mpihello
>> 
>> ======================   ALLOCATED NODES   ======================
>> 
>> Data for node: Name: pc15370 Num slots: 1    Max slots: 0
>> 
>> =================================================================
>> Hello World from Node 0.
>> Hello World from Node 1.
>> Hello World from Node 2.
>> Hello World from Node 3.
>> 
>> (Nothing in `strace` about accessing someting with "default")
>> 
>> 
>> reuti@pc15370:~> mpiexec --default-hostfile 
>> local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile --display-allocation 
>> -np 4 ./mpihello
>> 
>> ======================   ALLOCATED NODES   ======================
>> 
>> Data for node: Name: pc15370 Num slots: 2    Max slots: 0
>> Data for node: Name: pc15381 Num slots: 2    Max slots: 0
>> 
>> =================================================================
>> Hello World from Node 0.
>> Hello World from Node 3.
>> Hello World from Node 2.
>> Hello World from Node 1.
>> 
>> Specifying it works fine with correct distribution in `ps`.
>> 
>> -- Reuti
>> 
>> 
>>> On Feb 1, 2012, at 7:58 AM, Reuti wrote:
>>> 
>>>> Am 01.02.2012 um 15:38 schrieb Ralph Castain:
>>>> 
>>>>> On Feb 1, 2012, at 3:49 AM, Reuti wrote:
>>>>> 
>>>>>> Am 31.01.2012 um 21:25 schrieb Ralph Castain:
>>>>>> 
>>>>>>> On Jan 31, 2012, at 12:58 PM, Reuti wrote:
>>>>>> 
>>>>>> BTW: is there any default for a hostfile for Open MPI - I mean any in my 
>>>>>> home directory or /etc? When I check `man orte_hosts`, and all possible 
>>>>>> optiions are unset (like in a singleton run), it will only run local 
>>>>>> (Job is co-located with mpirun).
>>>>> 
>>>>> Yep - it is <prefix>/etc/openmpi-default-hostfile
>>>> 
>>>> Thx for replying Ralph.
>>>> 
>>>> I spotted it too, but this is not working for me. Neither for mpiexec from 
>>>> the command line, nor any singleton. I also tried a plain /etc as location 
>>>> of this file as well.
>>>> 
>>>> reuti@pc15370:~> which mpicc
>>>> /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc
>>>> reuti@pc15370:~> cat 
>>>> /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile
>>>> pc15370 slots=2
>>>> pc15381 slots=2
>>>> reuti@pc15370:~> mpicc -o mpihello mpihello.c
>>>> reuti@pc15370:~> mpiexec -np 4 ./mpihello
>>>> Hello World from Node 0.
>>>> Hello World from Node 1.
>>>> Hello World from Node 2.
>>>> Hello World from Node 3.
>>>> 
>>>> But all is local (no spawn here, traditional mpihello):
>>>> 
>>>> 19503 ?        Ss     0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid
>>>> 11583 ?        Ss     0:00  \_ sshd: reuti [priv]                          
>>>>        
>>>> 11585 ?        S      0:00  |   \_ sshd: reuti@pts/6                       
>>>>            
>>>> 11587 pts/6    Ss     0:00  |       \_ -bash
>>>> 13470 pts/6    S+     0:00  |           \_ mpiexec -np 4 ./mpihello
>>>> 13471 pts/6    R+     0:00  |               \_ ./mpihello
>>>> 13472 pts/6    R+     0:00  |               \_ ./mpihello
>>>> 13473 pts/6    R+     0:00  |               \_ ./mpihello
>>>> 13474 pts/6    R+     0:00  |               \_ ./mpihello
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>>>> We probably aren't correctly marking the original singleton on that 
>>>>>>> node, and so the mapper thinks there are still two slots available on 
>>>>>>> the original node.
>>>>>> 
>>>>>> Okay. There is something to discuss/fix. BTW: if started as singleton I 
>>>>>> get an error at the end with the program the OP provided:
>>>>>> 
>>>>>> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline 
>>>>>> [[12435,0],0] lost
>>>>> 
>>>>> Okay, I'll take a look at it - but it may take awhile before I can 
>>>>> address either issue as other priorities loom.
>>>>> 
>>>>>> 
>>>>>> It's not the case if run by mpiexec.
>>>>>> 
>>>>>> -- Reuti
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


Reply via email to