Okay, so this is a Sparc issue, not a rankfile one. I'm afraid my lack of time 
and access to that platform will mean this won't get fixed for 1.7.4, but I'll 
try to take a look at it when time permits.


On Jan 22, 2014, at 10:52 PM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:

> Dear Ralph,
> 
> the same problems occur without rankfiles.
> 
> tyr fd1026 102 which mpicc
> /usr/local/openmpi-1.7.4_64_cc/bin/mpicc
> 
> tyr fd1026 103 mpiexec --report-bindings -np 2 \
>  -host tyr,sunpc1 hostname
> 
> tyr fd1026 104 /opt/solstudio12.3/bin/sparcv9/dbx \
>  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec 
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message
>  7.9' in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> Reading libopen-rte.so.7.0.0
> Reading libopen-pal.so.6.1.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) check -all
> access checking - ON
> memuse checking - ON
> (dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname
> Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname 
> (process id 26792)
> Reading rtcapihook.so
> Reading libdl.so.1
> Reading rtcaudit.so
> Reading libmapmalloc.so.1
> Reading libgen.so.1
> Reading libc_psr.so.1
> Reading rtcboot.so
> Reading librtc.so
> Reading libmd_psr.so.1
> RTC: Enabling Error Checking...
> RTC: Using UltraSparc trap mechanism
> RTC: See `help rtc showmap' and `help rtc limitations' for details.
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0xffffffff7fffc85b
>    which is 459 bytes above the current stack pointer
> Variable is 'cwd'
> t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
>   65           if (0 != strcmp(pwd, cwd)) {
> (dbx) quit
> 
> 
> tyr fd1026 105 ssh sunpc1
> ...
> sunpc1 fd1026 102 mpiexec --report-bindings -np 2 \
>  -host tyr,sunpc1 hostname
> 
> sunpc1 fd1026 103 /opt/solstudio12.3/bin/amd64/dbx \
>  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message
>  7.9' in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> Reading libopen-rte.so.7.0.0
> Reading libopen-pal.so.6.1.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) check -all
> access checking - ON
> memuse checking - ON
> (dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname
> Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname 
> (process id 18806)
> Reading rtcapihook.so
> Reading libdl.so.1
> Reading rtcaudit.so
> Reading libmapmalloc.so.1
> Reading libgen.so.1
> Reading rtcboot.so
> Reading librtc.so
> RTC: Enabling Error Checking...
> RTC: Running program...
> Reading disasm.so
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x436d57
>    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
> This block was allocated from:
>        [1] vasprintf() at 0xfffffd7fdc9b335a 
>        [2] asprintf() at 0xfffffd7fdc9b3452 
>        [3] opal_output_init() at line 184 in "output.c"
>        [4] do_open() at line 548 in "output.c"
>        [5] opal_output_open() at line 219 in "output.c"
>        [6] opal_malloc_init() at line 68 in "malloc.c"
>        [7] opal_init_util() at line 250 in "opal_init.c"
>        [8] orterun() at line 658 in "orterun.c"
> 
> t@1 (l@1) stopped in do_open at line 638 in file "output.c"
>  638           info[i].ldi_prefix = strdup(lds->lds_prefix);
> (dbx) run --report-bindings -np 2 -host sunpc0,sunpc1 hostname
> Running: mpiexec --report-bindings -np 2 -host sunpc0,sunpc1 hostname 
> (process id 18857)
> RTC: Enabling Error Checking...
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x436d57
>    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
> This block was allocated from:
>        [1] vasprintf() at 0xfffffd7fdc9b335a 
>        [2] asprintf() at 0xfffffd7fdc9b3452 
>        [3] opal_output_init() at line 184 in "output.c"
>        [4] do_open() at line 548 in "output.c"
>        [5] opal_output_open() at line 219 in "output.c"
>        [6] opal_malloc_init() at line 68 in "malloc.c"
>        [7] opal_init_util() at line 250 in "opal_init.c"
>        [8] orterun() at line 658 in "orterun.c"
> 
> t@1 (l@1) stopped in do_open at line 638 in file "output.c"
>  638           info[i].ldi_prefix = strdup(lds->lds_prefix);
> (dbx) run --report-bindings -np 2 -host linpc1,sunpc1 hostname
> Running: mpiexec --report-bindings -np 2 -host linpc1,sunpc1 hostname 
> (process id 18868)
> RTC: Enabling Error Checking...
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x436d57
>    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
> This block was allocated from:
>        [1] vasprintf() at 0xfffffd7fdc9b335a 
>        [2] asprintf() at 0xfffffd7fdc9b3452 
>        [3] opal_output_init() at line 184 in "output.c"
>        [4] do_open() at line 548 in "output.c"
>        [5] opal_output_open() at line 219 in "output.c"
>        [6] opal_malloc_init() at line 68 in "malloc.c"
>        [7] opal_init_util() at line 250 in "opal_init.c"
>        [8] orterun() at line 658 in "orterun.c"
> 
> t@1 (l@1) stopped in do_open at line 638 in file "output.c"
>  638           info[i].ldi_prefix = strdup(lds->lds_prefix);
> (dbx) quit
> sunpc1 fd1026 104 exit
> logout
> tyr fd1026 106 
> 
> 
> Do you need anything else?
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
>> Hard to know how to address all that, Siegmar, but I'll give it
>> a shot. See below.
>> 
>> On Jan 22, 2014, at 5:34 AM, Siegmar Gross 
>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>> 
>>> Hi,
>>> 
>>> yesterday I installed openmpi-1.7.4rc2r30323 on our machines
>>> ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux
>>> 12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr"
>>> contains the following lines.
>>> 
>>> rank 0=linpc0 slot=0:0-1;1:0-1
>>> rank 1=linpc1 slot=0:0-1
>>> rank 2=sunpc1 slot=1:0
>>> rank 3=tyr slot=1:0
>>> 
>>> I get no output, when I run the following command.
>>> 
>>> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
>>> 
>>> "dbx" reports the following problem.
>>> 
>>> /opt/solstudio12.3/bin/sparcv9/dbx \
>>> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
>>> For information about new features see `help changes'
>>> To remove this message, put `dbxenv suppress_startup_message
>>> 7.9' in your .dbxrc
>>> Reading mpiexec
>>> Reading ld.so.1
>>> ...
>>> Reading libmd.so.1
>>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
>>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
>>> (process id 22337)
>>> Reading libc_psr.so.1
>>> ...
>>> Reading mca_dfs_test.so
>>> 
>>> execution completed, exit code is 1
>>> (dbx) check -all
>>> access checking - ON
>>> memuse checking - ON
>>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
>>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
>>> (process id 22344)
>>> Reading rtcapihook.so
>>> ...
>>> RTC: Running program...
>>> Read from uninitialized (rui) on thread 1:
>>> Attempting to read 1 byte at address 0xffffffff7fffbf8b
>>>   which is 459 bytes above the current stack pointer
>>> Variable is 'cwd'
>>> t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
>>>  65           if (0 != strcmp(pwd, cwd)) {
>>> (dbx) quit
>>> 
>> 
>> This looks like a bogus issue to me. Are you able to run something
>> *without* a rankfile? In other words, is it rankfile operation that
>> is causing a problem, or are you unable to run anything on Sparc?
>> 
>>> 
>>> 
>>> 
>>> Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile.
>>> 
>>> rank 0=linpc1 slot=0:0-1;1:0-1
>>> rank 1=sunpc1 slot=0:0-1
>>> rank 2=sunpc1 slot=1:0
>>> rank 3=sunpc1 slot=1:1
>>> 
>>> 
>>> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
>>> [sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]],
>>> socket 0[core 1[hwt 0]]: [B/B][./.]
>>> [sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
>>> [sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
>>> sunpc1
>>> sunpc1
>>> sunpc1
>>> [linpc1:29997] MCW rank 0 is not bound (or bound to all available
>>> processors)
>>> linpc1
>>> 
>>> 
>>> Unfortunately "dbx" reports nevertheless a problem.
>>> 
>>> /opt/solstudio12.3/bin/amd64/dbx \
>>> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
>>> For information about new features see `help changes'
>>> To remove this message, put `dbxenv suppress_startup_message 7.9'
>>> in your .dbxrc
>>> Reading mpiexec
>>> Reading ld.so.1
>>> ...
>>> Reading libmd.so.1
>>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
>>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
>>> (process id 18330)
>>> Reading mca_shmem_mmap.so
>>> ...
>>> Reading mca_dfs_test.so
>>> [sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]],
>>> socket 0[core 1[hwt 0]]: [B/B][./.]
>>> [sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
>>> [sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
>>> sunpc1
>>> sunpc1
>>> sunpc1
>>> [linpc1:30148] MCW rank 0 is not bound (or bound to all available
>>> processors)
>>> linpc1
>>> 
>>> execution completed, exit code is 0
>>> (dbx) check -all
>>> access checking - ON
>>> memuse checking - ON
>>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
>>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
>>> (process id 18340)
>>> Reading rtcapihook.so
>>> ...
>>> 
>>> RTC: Running program...
>>> Reading disasm.so
>>> Read from uninitialized (rui) on thread 1:
>>> Attempting to read 1 byte at address 0x436d57
>>>   which is 15 bytes into a heap block of size 16 bytes at 0x436d48
>>> This block was allocated from:
>>>       [1] vasprintf() at 0xfffffd7fdc9b335a 
>>>       [2] asprintf() at 0xfffffd7fdc9b3452 
>>>       [3] opal_output_init() at line 184 in "output.c"
>>>       [4] do_open() at line 548 in "output.c"
>>>       [5] opal_output_open() at line 219 in "output.c"
>>>       [6] opal_malloc_init() at line 68 in "malloc.c"
>>>       [7] opal_init_util() at line 250 in "opal_init.c"
>>>       [8] orterun() at line 658 in "orterun.c"
>>> 
>>> t@1 (l@1) stopped in do_open at line 638 in file "output.c"
>>> 638           info[i].ldi_prefix = strdup(lds->lds_prefix);
>>> (dbx) 
>>> 
>>> 
>> 
>> Again, I think dbx is just getting lost
>> 
>>> 
>>> 
>>> 
>>> I can also manually bind threads on our Sun M4000 server (two quad-core
>>> Sparc VII processors with two hwthreads each).
>>> 
>>> mpiexec --report-bindings -np 4 --bind-to hwthread hostname
>>> [rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to 
>>> socket 0[core 1[hwt 0]]: [../B./../..][../../../..]
>>> [rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to 
>>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
>>> [rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to 
>>> socket 1[core 5[hwt 0]]: [../../../..][../B./../..]
>>> [rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to 
>>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
>>> rs0.informatik.hs-fulda.de
>>> rs0.informatik.hs-fulda.de
>>> rs0.informatik.hs-fulda.de
>>> rs0.informatik.hs-fulda.de
>>> 
>>> 
>>> It doesn't work with cores. I know that it wasn't possible last
>>> summer and it seems that it is still not possible now.
>>> 
>>> mpiexec --report-bindings -np 4 --bind-to core hostname
>>> -----------------------------------------------------------------------
>>> Open MPI tried to bind a new process, but something went wrong.  The
>>> process was killed without launching the target application.  Your job
>>> will now abort.
>>> 
>>> Local host:        rs0
>>> Application name:  /usr/local/bin/hostname
>>> Error message:     hwloc indicates cpu binding cannot be enforced
>>> Location:          
>>> ../../../../../openmpi-1.9a1r30290/orte/mca/odls/default/odls_default_module.c:500
>>> -----------------------------------------------------------------------
>>> 4 total processes failed to start
>>> 
>>> 
>>> 
>>> Is it possible to specify hwthreads in a rankfile, so that I can
>>> use a rankfile for these machines?
>> 
>> Possible - yes. Will it happen in immediate future - no, I'm afraid
>> I'm swamped right now. However, I'll make a note of it for the future
>> 
>>> 
>>> I get the expected output, if I use two M4000 servers, although the
>>> above mentioned error still exists.
>>> 
>>> 
>>> /opt/solstudio12.3/bin/sparcv9/dbx \
>>> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
>>> For information about new features see `help changes'
>>> To remove this message, put `dbxenv suppress_startup_message 7.9'
>>> in your .dbxrc
>>> Reading mpiexec
>>> Reading ld.so.1
>>> ...
>>> Reading libmd.so.1
>>> (dbx) run --report-bindings --host rs0,rs1 -np 4 \
>>> --bind-to hwthread hostname
>>> Running: mpiexec --report-bindings --host rs0,rs1 -np 4
>>> --bind-to hwthread hostname 
>>> (process id 9599)
>>> Reading libc_psr.so.1
>>> ...
>>> Reading mca_dfs_test.so
>>> [rs0.informatik.hs-fulda.de:09599] MCW rank 1 bound to
>>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
>>> [rs0.informatik.hs-fulda.de:09599] MCW rank 0 bound to
>>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
>>> rs0.informatik.hs-fulda.de
>>> rs0.informatik.hs-fulda.de
>>> rs1.informatik.hs-fulda.de
>>> [rs1.informatik.hs-fulda.de:13398] MCW rank 2 bound to
>>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
>>> [rs1.informatik.hs-fulda.de:13398] MCW rank 3 bound to
>>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
>>> rs1.informatik.hs-fulda.de
>>> 
>>> execution completed, exit code is 0
>>> (dbx) check -all
>>> access checking - ON
>>> memuse checking - ON
>>> (dbx) run --report-bindings --host rs0,rs1 -np 4 \
>>> --bind-to hwthread hostname
>>> Running: mpiexec --report-bindings --host rs0,rs1 -np 4
>>> --bind-to hwthread hostname 
>>> (process id 9607)
>>> Reading rtcapihook.so
>>> ...
>>> RTC: Running program...
>>> Read from uninitialized (rui) on thread 1:
>>> Attempting to read 1 byte at address 0xffffffff7fffc80b
>>>   which is 459 bytes above the current stack pointer
>>> Variable is 'cwd'
>>> dbx: warning: can't find file
>>> ".../openmpi-1.7.4rc2r30323-SunOS.sparc.64_cc/opal/util/../../../
>>> openmpi-1.7.4rc2r30323/opal/util/opal_getcwd.c"
>>> dbx: warning: see `help finding-files'
>>> t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
>>> (dbx) 
>>> 
>>> 
>>> Our M4000 server has no access to the source code, so that it couldn't
>>> find the file. Nevertheless it is the same error message as above. Is it
>>> possible that someone soves this problem? Thank you very much for any
>>> help in advance.
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to