Let me take a look later today on my Linux box.

On Feb 9, 2013, at 10:42 AM, Eugene Loh <eugene....@oracle.com> wrote:

> On 02/09/13 00:32, Ralph Castain wrote:
>> On Feb 6, 2013, at 2:59 PM, Eugene Loh <eugene....@oracle.com> wrote:
>>> On 02/06/13 04:29, Siegmar Gross wrote:
>>>> thank you very much for your answer. I have compiled your program
>>>> and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9.
>>> I think what's happening is that although you specified "0:0" or "0:1" in 
>>> the rankfile, the string "0,0" or "0,1" is getting passed in (at least in 
>>> the runs I looked at).  That colon became a comma.  So, it's just by 
>>> accident that myrankfile_0 is working out all right.
>>> 
>>> Could someone who knows the code better than I do help me narrow this down? 
>>>  E.g., where is the rankfile parsed?  For what it's worth, by the time 
>>> mpirun reaches orte_odls_base_default_get_add_procs_data(), orte_job_data 
>>> already contains the corrupted cpu_bitmap string.
>> You'll want to look at orte/mca/rmaps/rank_file/rmaps_rank_file.c - the bit 
>> map is now computed in mpirun and then sent to the daemons
> 
> Actually, I'm getting lost in this code.  Anyhow, I don't think the problem 
> is related to Solaris.  I think it's also on Linux. E.g., I can reproduce the 
> problem with 1.9a1r28035 on Linux using GCC compilers.
> 
> Siegmar: can you confirm this is a problem also on Linux?  E.g., with OMPI 
> 1.9, on one of your Linux nodes (linpc0?) try
> 
>    % cat myrankfile
>    rank 0=linpc0 slot=0:1
>    % mpirun --report-bindings --rankfile myrankfile numactl --show
> 
> For me, the binding I get is not 0:1 but 0,1.
> 
> Could someone else take a look at this?
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to