Okay, so this is a Sparc issue, not a rankfile one. I'm afraid my lack of time and access to that platform will mean this won't get fixed for 1.7.4, but I'll try to take a look at it when time permits.
On Jan 22, 2014, at 10:52 PM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > Dear Ralph, > > the same problems occur without rankfiles. > > tyr fd1026 102 which mpicc > /usr/local/openmpi-1.7.4_64_cc/bin/mpicc > > tyr fd1026 103 mpiexec --report-bindings -np 2 \ > -host tyr,sunpc1 hostname > > tyr fd1026 104 /opt/solstudio12.3/bin/sparcv9/dbx \ > /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec > For information about new features see `help changes' > To remove this message, put `dbxenv suppress_startup_message > 7.9' in your .dbxrc > Reading mpiexec > Reading ld.so.1 > Reading libopen-rte.so.7.0.0 > Reading libopen-pal.so.6.1.0 > Reading libsendfile.so.1 > Reading libpicl.so.1 > Reading libkstat.so.1 > Reading liblgrp.so.1 > Reading libsocket.so.1 > Reading libnsl.so.1 > Reading librt.so.1 > Reading libm.so.2 > Reading libthread.so.1 > Reading libc.so.1 > Reading libdoor.so.1 > Reading libaio.so.1 > Reading libmd.so.1 > (dbx) check -all > access checking - ON > memuse checking - ON > (dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname > Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname > (process id 26792) > Reading rtcapihook.so > Reading libdl.so.1 > Reading rtcaudit.so > Reading libmapmalloc.so.1 > Reading libgen.so.1 > Reading libc_psr.so.1 > Reading rtcboot.so > Reading librtc.so > Reading libmd_psr.so.1 > RTC: Enabling Error Checking... > RTC: Using UltraSparc trap mechanism > RTC: See `help rtc showmap' and `help rtc limitations' for details. > RTC: Running program... > Read from uninitialized (rui) on thread 1: > Attempting to read 1 byte at address 0xffffffff7fffc85b > which is 459 bytes above the current stack pointer > Variable is 'cwd' > t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" > 65 if (0 != strcmp(pwd, cwd)) { > (dbx) quit > > > tyr fd1026 105 ssh sunpc1 > ... > sunpc1 fd1026 102 mpiexec --report-bindings -np 2 \ > -host tyr,sunpc1 hostname > > sunpc1 fd1026 103 /opt/solstudio12.3/bin/amd64/dbx \ > /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec > For information about new features see `help changes' > To remove this message, put `dbxenv suppress_startup_message > 7.9' in your .dbxrc > Reading mpiexec > Reading ld.so.1 > Reading libopen-rte.so.7.0.0 > Reading libopen-pal.so.6.1.0 > Reading libsendfile.so.1 > Reading libpicl.so.1 > Reading libkstat.so.1 > Reading liblgrp.so.1 > Reading libsocket.so.1 > Reading libnsl.so.1 > Reading librt.so.1 > Reading libm.so.2 > Reading libthread.so.1 > Reading libc.so.1 > Reading libdoor.so.1 > Reading libaio.so.1 > Reading libmd.so.1 > (dbx) check -all > access checking - ON > memuse checking - ON > (dbx) run --report-bindings -np 2 -host tyr,sunpc1 hostname > Running: mpiexec --report-bindings -np 2 -host tyr,sunpc1 hostname > (process id 18806) > Reading rtcapihook.so > Reading libdl.so.1 > Reading rtcaudit.so > Reading libmapmalloc.so.1 > Reading libgen.so.1 > Reading rtcboot.so > Reading librtc.so > RTC: Enabling Error Checking... > RTC: Running program... > Reading disasm.so > Read from uninitialized (rui) on thread 1: > Attempting to read 1 byte at address 0x436d57 > which is 15 bytes into a heap block of size 16 bytes at 0x436d48 > This block was allocated from: > [1] vasprintf() at 0xfffffd7fdc9b335a > [2] asprintf() at 0xfffffd7fdc9b3452 > [3] opal_output_init() at line 184 in "output.c" > [4] do_open() at line 548 in "output.c" > [5] opal_output_open() at line 219 in "output.c" > [6] opal_malloc_init() at line 68 in "malloc.c" > [7] opal_init_util() at line 250 in "opal_init.c" > [8] orterun() at line 658 in "orterun.c" > > t@1 (l@1) stopped in do_open at line 638 in file "output.c" > 638 info[i].ldi_prefix = strdup(lds->lds_prefix); > (dbx) run --report-bindings -np 2 -host sunpc0,sunpc1 hostname > Running: mpiexec --report-bindings -np 2 -host sunpc0,sunpc1 hostname > (process id 18857) > RTC: Enabling Error Checking... > RTC: Running program... > Read from uninitialized (rui) on thread 1: > Attempting to read 1 byte at address 0x436d57 > which is 15 bytes into a heap block of size 16 bytes at 0x436d48 > This block was allocated from: > [1] vasprintf() at 0xfffffd7fdc9b335a > [2] asprintf() at 0xfffffd7fdc9b3452 > [3] opal_output_init() at line 184 in "output.c" > [4] do_open() at line 548 in "output.c" > [5] opal_output_open() at line 219 in "output.c" > [6] opal_malloc_init() at line 68 in "malloc.c" > [7] opal_init_util() at line 250 in "opal_init.c" > [8] orterun() at line 658 in "orterun.c" > > t@1 (l@1) stopped in do_open at line 638 in file "output.c" > 638 info[i].ldi_prefix = strdup(lds->lds_prefix); > (dbx) run --report-bindings -np 2 -host linpc1,sunpc1 hostname > Running: mpiexec --report-bindings -np 2 -host linpc1,sunpc1 hostname > (process id 18868) > RTC: Enabling Error Checking... > RTC: Running program... > Read from uninitialized (rui) on thread 1: > Attempting to read 1 byte at address 0x436d57 > which is 15 bytes into a heap block of size 16 bytes at 0x436d48 > This block was allocated from: > [1] vasprintf() at 0xfffffd7fdc9b335a > [2] asprintf() at 0xfffffd7fdc9b3452 > [3] opal_output_init() at line 184 in "output.c" > [4] do_open() at line 548 in "output.c" > [5] opal_output_open() at line 219 in "output.c" > [6] opal_malloc_init() at line 68 in "malloc.c" > [7] opal_init_util() at line 250 in "opal_init.c" > [8] orterun() at line 658 in "orterun.c" > > t@1 (l@1) stopped in do_open at line 638 in file "output.c" > 638 info[i].ldi_prefix = strdup(lds->lds_prefix); > (dbx) quit > sunpc1 fd1026 104 exit > logout > tyr fd1026 106 > > > Do you need anything else? > > > Kind regards > > Siegmar > > > >> Hard to know how to address all that, Siegmar, but I'll give it >> a shot. See below. >> >> On Jan 22, 2014, at 5:34 AM, Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >>> Hi, >>> >>> yesterday I installed openmpi-1.7.4rc2r30323 on our machines >>> ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux >>> 12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr" >>> contains the following lines. >>> >>> rank 0=linpc0 slot=0:0-1;1:0-1 >>> rank 1=linpc1 slot=0:0-1 >>> rank 2=sunpc1 slot=1:0 >>> rank 3=tyr slot=1:0 >>> >>> I get no output, when I run the following command. >>> >>> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname >>> >>> "dbx" reports the following problem. >>> >>> /opt/solstudio12.3/bin/sparcv9/dbx \ >>> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec >>> For information about new features see `help changes' >>> To remove this message, put `dbxenv suppress_startup_message >>> 7.9' in your .dbxrc >>> Reading mpiexec >>> Reading ld.so.1 >>> ... >>> Reading libmd.so.1 >>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname >>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname >>> (process id 22337) >>> Reading libc_psr.so.1 >>> ... >>> Reading mca_dfs_test.so >>> >>> execution completed, exit code is 1 >>> (dbx) check -all >>> access checking - ON >>> memuse checking - ON >>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname >>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname >>> (process id 22344) >>> Reading rtcapihook.so >>> ... >>> RTC: Running program... >>> Read from uninitialized (rui) on thread 1: >>> Attempting to read 1 byte at address 0xffffffff7fffbf8b >>> which is 459 bytes above the current stack pointer >>> Variable is 'cwd' >>> t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" >>> 65 if (0 != strcmp(pwd, cwd)) { >>> (dbx) quit >>> >> >> This looks like a bogus issue to me. Are you able to run something >> *without* a rankfile? In other words, is it rankfile operation that >> is causing a problem, or are you unable to run anything on Sparc? >> >>> >>> >>> >>> Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile. >>> >>> rank 0=linpc1 slot=0:0-1;1:0-1 >>> rank 1=sunpc1 slot=0:0-1 >>> rank 2=sunpc1 slot=1:0 >>> rank 3=sunpc1 slot=1:1 >>> >>> >>> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname >>> [sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]], >>> socket 0[core 1[hwt 0]]: [B/B][./.] >>> [sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] >>> [sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] >>> sunpc1 >>> sunpc1 >>> sunpc1 >>> [linpc1:29997] MCW rank 0 is not bound (or bound to all available >>> processors) >>> linpc1 >>> >>> >>> Unfortunately "dbx" reports nevertheless a problem. >>> >>> /opt/solstudio12.3/bin/amd64/dbx \ >>> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec >>> For information about new features see `help changes' >>> To remove this message, put `dbxenv suppress_startup_message 7.9' >>> in your .dbxrc >>> Reading mpiexec >>> Reading ld.so.1 >>> ... >>> Reading libmd.so.1 >>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname >>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname >>> (process id 18330) >>> Reading mca_shmem_mmap.so >>> ... >>> Reading mca_dfs_test.so >>> [sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]], >>> socket 0[core 1[hwt 0]]: [B/B][./.] >>> [sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] >>> [sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] >>> sunpc1 >>> sunpc1 >>> sunpc1 >>> [linpc1:30148] MCW rank 0 is not bound (or bound to all available >>> processors) >>> linpc1 >>> >>> execution completed, exit code is 0 >>> (dbx) check -all >>> access checking - ON >>> memuse checking - ON >>> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname >>> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname >>> (process id 18340) >>> Reading rtcapihook.so >>> ... >>> >>> RTC: Running program... >>> Reading disasm.so >>> Read from uninitialized (rui) on thread 1: >>> Attempting to read 1 byte at address 0x436d57 >>> which is 15 bytes into a heap block of size 16 bytes at 0x436d48 >>> This block was allocated from: >>> [1] vasprintf() at 0xfffffd7fdc9b335a >>> [2] asprintf() at 0xfffffd7fdc9b3452 >>> [3] opal_output_init() at line 184 in "output.c" >>> [4] do_open() at line 548 in "output.c" >>> [5] opal_output_open() at line 219 in "output.c" >>> [6] opal_malloc_init() at line 68 in "malloc.c" >>> [7] opal_init_util() at line 250 in "opal_init.c" >>> [8] orterun() at line 658 in "orterun.c" >>> >>> t@1 (l@1) stopped in do_open at line 638 in file "output.c" >>> 638 info[i].ldi_prefix = strdup(lds->lds_prefix); >>> (dbx) >>> >>> >> >> Again, I think dbx is just getting lost >> >>> >>> >>> >>> I can also manually bind threads on our Sun M4000 server (two quad-core >>> Sparc VII processors with two hwthreads each). >>> >>> mpiexec --report-bindings -np 4 --bind-to hwthread hostname >>> [rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to >>> socket 0[core 1[hwt 0]]: [../B./../..][../../../..] >>> [rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to >>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..] >>> [rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to >>> socket 1[core 5[hwt 0]]: [../../../..][../B./../..] >>> [rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to >>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..] >>> rs0.informatik.hs-fulda.de >>> rs0.informatik.hs-fulda.de >>> rs0.informatik.hs-fulda.de >>> rs0.informatik.hs-fulda.de >>> >>> >>> It doesn't work with cores. I know that it wasn't possible last >>> summer and it seems that it is still not possible now. >>> >>> mpiexec --report-bindings -np 4 --bind-to core hostname >>> ----------------------------------------------------------------------- >>> Open MPI tried to bind a new process, but something went wrong. The >>> process was killed without launching the target application. Your job >>> will now abort. >>> >>> Local host: rs0 >>> Application name: /usr/local/bin/hostname >>> Error message: hwloc indicates cpu binding cannot be enforced >>> Location: >>> ../../../../../openmpi-1.9a1r30290/orte/mca/odls/default/odls_default_module.c:500 >>> ----------------------------------------------------------------------- >>> 4 total processes failed to start >>> >>> >>> >>> Is it possible to specify hwthreads in a rankfile, so that I can >>> use a rankfile for these machines? >> >> Possible - yes. Will it happen in immediate future - no, I'm afraid >> I'm swamped right now. However, I'll make a note of it for the future >> >>> >>> I get the expected output, if I use two M4000 servers, although the >>> above mentioned error still exists. >>> >>> >>> /opt/solstudio12.3/bin/sparcv9/dbx \ >>> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec >>> For information about new features see `help changes' >>> To remove this message, put `dbxenv suppress_startup_message 7.9' >>> in your .dbxrc >>> Reading mpiexec >>> Reading ld.so.1 >>> ... >>> Reading libmd.so.1 >>> (dbx) run --report-bindings --host rs0,rs1 -np 4 \ >>> --bind-to hwthread hostname >>> Running: mpiexec --report-bindings --host rs0,rs1 -np 4 >>> --bind-to hwthread hostname >>> (process id 9599) >>> Reading libc_psr.so.1 >>> ... >>> Reading mca_dfs_test.so >>> [rs0.informatik.hs-fulda.de:09599] MCW rank 1 bound to >>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..] >>> [rs0.informatik.hs-fulda.de:09599] MCW rank 0 bound to >>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..] >>> rs0.informatik.hs-fulda.de >>> rs0.informatik.hs-fulda.de >>> rs1.informatik.hs-fulda.de >>> [rs1.informatik.hs-fulda.de:13398] MCW rank 2 bound to >>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..] >>> [rs1.informatik.hs-fulda.de:13398] MCW rank 3 bound to >>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..] >>> rs1.informatik.hs-fulda.de >>> >>> execution completed, exit code is 0 >>> (dbx) check -all >>> access checking - ON >>> memuse checking - ON >>> (dbx) run --report-bindings --host rs0,rs1 -np 4 \ >>> --bind-to hwthread hostname >>> Running: mpiexec --report-bindings --host rs0,rs1 -np 4 >>> --bind-to hwthread hostname >>> (process id 9607) >>> Reading rtcapihook.so >>> ... >>> RTC: Running program... >>> Read from uninitialized (rui) on thread 1: >>> Attempting to read 1 byte at address 0xffffffff7fffc80b >>> which is 459 bytes above the current stack pointer >>> Variable is 'cwd' >>> dbx: warning: can't find file >>> ".../openmpi-1.7.4rc2r30323-SunOS.sparc.64_cc/opal/util/../../../ >>> openmpi-1.7.4rc2r30323/opal/util/opal_getcwd.c" >>> dbx: warning: see `help finding-files' >>> t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" >>> (dbx) >>> >>> >>> Our M4000 server has no access to the source code, so that it couldn't >>> find the file. Nevertheless it is the same error message as above. Is it >>> possible that someone soves this problem? Thank you very much for any >>> help in advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users