[OMPI users] problem with rankfile in openmpi-1.7.4rc2r30323

2014-01-22 Thread Siegmar Gross
Hi,

yesterday I installed openmpi-1.7.4rc2r30323 on our machines
("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux
12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr"
contains the following lines.

rank 0=linpc0 slot=0:0-1;1:0-1
rank 1=linpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=tyr slot=1:0

I get no output, when I run the following command.

mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname

"dbx" reports the following problem.

/opt/solstudio12.3/bin/sparcv9/dbx \
  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message
  7.9' in your .dbxrc
Reading mpiexec
Reading ld.so.1
...
Reading libmd.so.1
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
(process id 22337)
Reading libc_psr.so.1
...
Reading mca_dfs_test.so

execution completed, exit code is 1
(dbx) check -all
access checking - ON
memuse checking - ON
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
(process id 22344)
Reading rtcapihook.so
...
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0x7fffbf8b
which is 459 bytes above the current stack pointer
Variable is 'cwd'
t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
   65   if (0 != strcmp(pwd, cwd)) {
(dbx) quit




Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile.

rank 0=linpc1 slot=0:0-1;1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1


mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
[sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]]: [B/B][./.]
[sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
[sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
sunpc1
sunpc1
sunpc1
[linpc1:29997] MCW rank 0 is not bound (or bound to all available
  processors)
linpc1


Unfortunately "dbx" reports nevertheless a problem.

/opt/solstudio12.3/bin/amd64/dbx \
  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9'
  in your .dbxrc
Reading mpiexec
Reading ld.so.1
...
Reading libmd.so.1
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
(process id 18330)
Reading mca_shmem_mmap.so
...
Reading mca_dfs_test.so
[sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]]: [B/B][./.]
[sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
[sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
sunpc1
sunpc1
sunpc1
[linpc1:30148] MCW rank 0 is not bound (or bound to all available
  processors)
linpc1

execution completed, exit code is 0
(dbx) check -all
access checking - ON
memuse checking - ON
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
(process id 18340)
Reading rtcapihook.so
...

RTC: Running program...
Reading disasm.so
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0x436d57
which is 15 bytes into a heap block of size 16 bytes at 0x436d48
This block was allocated from:
[1] vasprintf() at 0xfd7fdc9b335a 
[2] asprintf() at 0xfd7fdc9b3452 
[3] opal_output_init() at line 184 in "output.c"
[4] do_open() at line 548 in "output.c"
[5] opal_output_open() at line 219 in "output.c"
[6] opal_malloc_init() at line 68 in "malloc.c"
[7] opal_init_util() at line 250 in "opal_init.c"
[8] orterun() at line 658 in "orterun.c"

t@1 (l@1) stopped in do_open at line 638 in file "output.c"
  638   info[i].ldi_prefix = strdup(lds->lds_prefix);
(dbx) 





I can also manually bind threads on our Sun M4000 server (two quad-core
Sparc VII processors with two hwthreads each).

mpiexec --report-bindings -np 4 --bind-to hwthread hostname
[rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to 
  socket 0[core 1[hwt 0]]: [../B./../..][../../../..]
[rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to 
  socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
[rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to 
  socket 1[core 5[hwt 0]]: [../../../..][../B./../..]
[rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to 
  socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de


It doesn't work with cores. I know that it wasn't possible last
summer and it seems that it is still not possible now.

mpiexec --report-bindings -np 4 --bind-to core hostname
--

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-22 Thread Fischer, Greg A.
Well, this is a little strange. The hanging behavior is gone, but I'm getting a 
segfault now. The output of "hello_c.c" and "ring_c.c" are attached. 

I'm getting a segfault with the Fortran test, also. I'm afraid I may have 
polluted the experiment by removing the target openmpi-1.6.5 installation 
directory yesterday. To produce the attached outputs, I just went back and did 
"make install" in the openmpi-1.6.5 build directory. I've re-set the 
environment variables as they were a few days ago by sourcing the same bash 
script. Perhaps I forgot something, or something on the system changed? 
Regardless, LD_LIBRARY_PATH and PATH are set correctly, and aberrant behavior 
persists.

The reason for deleting the openmpi-1.6.5 installation was that I went back and 
installed openmpi-1.4.3 and the problem (mostly) went away. Openmpi-1.4.3 can 
run the simple tests without issue, but on my "real" program, I'm getting 
symbol lookup errors: 

mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int

Perhaps that's a separate thread.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>Squyres (jsquyres)
>Sent: Tuesday, January 21, 2014 3:57 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
>consumes all system resources
>
>Just for giggles, can you repeat the same test but with hello_c.c and ring_c.c?
>I.e., let's get the Fortran out of the way and use just the base C bindings, 
>and
>see what happens.
>
>
>On Jan 19, 2014, at 6:18 PM, "Fischer, Greg A." 
>wrote:
>
>> I just tried running "hello_f90.f90" and see the same behavior: 100% CPU
>usage, gradually increasing memory consumption, and failure to get past
>mpi_finalize. LD_LIBRARY_PATH is set as:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib
>>
>> The installation target for this version of OpenMPI is:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5
>>
>> 1045
>> fischega@lxlogin2[/data/fischega/petsc_configure/mpi_test/simple]>
>> which mpirun
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin/mpir
>> un
>>
>> Perhaps something strange is happening with GCC? I've tried simple hello
>world C and Fortran programs, and they work normally.
>>
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph
>> Castain
>> Sent: Sunday, January 19, 2014 11:36 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize
>> and consumes all system resources
>>
>> The OFED warning about registration is something OMPI added at one point
>when we isolated the cause of jobs occasionally hanging, so you won't see
>that warning from other MPIs or earlier versions of OMPI (I forget exactly
>when we added it).
>>
>> The problem you describe doesn't sound like an OMPI issue - it sounds like
>you've got a memory corruption problem in the code. Have you tried running
>the examples in our example directory to confirm that the installation is
>good?
>>
>> Also, check to ensure that your LD_LIBRARY_PATH is correctly set to pickup
>the OMPI libs you installed - most Linux distros come with an older version,
>and that can cause problems if you inadvertently pick them up.
>>
>>
>> On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. 
>wrote:
>>
>>
>> Hello,
>>
>> I have a simple, 1-process test case that gets stuck on the mpi_finalize 
>> call.
>The test case is a dead-simple calculation of pi - 50 lines of Fortran. The
>process gradually consumes more and more memory until the system
>becomes unresponsive and needs to be rebooted, unless the job is killed
>first.
>>
>> In the output, attached, I see the warning message about OpenFabrics
>being configured to only allow registering part of physical memory. I've tried
>to chase this down with my administrator to no avail yet. (I am aware of the
>relevant FAQ entry.)  A different installation of MPI on the same system,
>made with a different compiler, does not produce the OpenFabrics memory
>registration warning - which seems strange because I thought it was a system
>configuration issue independent of MPI. Also curious in the output is that LSF
>seems to think there are 7 processes and 11 threads associated with this job.
>>
>> The particulars of my configuration are attached and detailed below. Does
>anyone see anything potentially problematic?
>>
>> Thanks,
>> Greg
>>
>> OpenMPI Version: 1.6.5
>> Compiler: GCC 4.6.1
>> OS: SuSE Linux Enterprise Server 10, Patchlevel 2
>>
>> uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02
>> UTC 2008 x86_64 x86_64 x86_64 GNU/Linux
>>
>> LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-
>4.6.1/toolset/openmp
>> i-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/
>> lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib
>>
>> PATH=
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tool
>> s/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi

Re: [OMPI users] problem with rankfile in openmpi-1.7.4rc2r30323

2014-01-22 Thread Ralph Castain
Hard to know how to address all that, Siegmar, but I'll give it a shot. See 
below.

On Jan 22, 2014, at 5:34 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> yesterday I installed openmpi-1.7.4rc2r30323 on our machines
> ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux
> 12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr"
> contains the following lines.
> 
> rank 0=linpc0 slot=0:0-1;1:0-1
> rank 1=linpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=tyr slot=1:0
> 
> I get no output, when I run the following command.
> 
> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> 
> "dbx" reports the following problem.
> 
> /opt/solstudio12.3/bin/sparcv9/dbx \
>  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message
>  7.9' in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> ...
> Reading libmd.so.1
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
> (process id 22337)
> Reading libc_psr.so.1
> ...
> Reading mca_dfs_test.so
> 
> execution completed, exit code is 1
> (dbx) check -all
> access checking - ON
> memuse checking - ON
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname 
> (process id 22344)
> Reading rtcapihook.so
> ...
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x7fffbf8b
>which is 459 bytes above the current stack pointer
> Variable is 'cwd'
> t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
>   65   if (0 != strcmp(pwd, cwd)) {
> (dbx) quit
> 

This looks like a bogus issue to me. Are you able to run something *without* a 
rankfile? In other words, is it rankfile operation that is causing a problem, 
or are you unable to run anything on Sparc?

> 
> 
> 
> Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile.
> 
> rank 0=linpc1 slot=0:0-1;1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
> 
> 
> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> [sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]],
>  socket 0[core 1[hwt 0]]: [B/B][./.]
> [sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> [sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
> sunpc1
> sunpc1
> sunpc1
> [linpc1:29997] MCW rank 0 is not bound (or bound to all available
>  processors)
> linpc1
> 
> 
> Unfortunately "dbx" reports nevertheless a problem.
> 
> /opt/solstudio12.3/bin/amd64/dbx \
>  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9'
>  in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> ...
> Reading libmd.so.1
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
> (process id 18330)
> Reading mca_shmem_mmap.so
> ...
> Reading mca_dfs_test.so
> [sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]],
>  socket 0[core 1[hwt 0]]: [B/B][./.]
> [sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> [sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
> sunpc1
> sunpc1
> sunpc1
> [linpc1:30148] MCW rank 0 is not bound (or bound to all available
>  processors)
> linpc1
> 
> execution completed, exit code is 0
> (dbx) check -all
> access checking - ON
> memuse checking - ON
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname 
> (process id 18340)
> Reading rtcapihook.so
> ...
> 
> RTC: Running program...
> Reading disasm.so
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x436d57
>which is 15 bytes into a heap block of size 16 bytes at 0x436d48
> This block was allocated from:
>[1] vasprintf() at 0xfd7fdc9b335a 
>[2] asprintf() at 0xfd7fdc9b3452 
>[3] opal_output_init() at line 184 in "output.c"
>[4] do_open() at line 548 in "output.c"
>[5] opal_output_open() at line 219 in "output.c"
>[6] opal_malloc_init() at line 68 in "malloc.c"
>[7] opal_init_util() at line 250 in "opal_init.c"
>[8] orterun() at line 658 in "orterun.c"
> 
> t@1 (l@1) stopped in do_open at line 638 in file "output.c"
>  638   info[i].ldi_prefix = strdup(lds->lds_prefix);
> (dbx) 
> 
> 

Again, I think dbx is just getting lost

> 
> 
> 
> I can also manually bind threads on our Sun M4000 server (two quad-core
> Sparc VII processors with two hwthreads each).
> 
> mpiexec --report-bindings -np 4 --bind-to hwthread hostname
> [rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to 
>  socket 0[core 1[hwt 0]]: [../B./

[OMPI users] default num_procs of round_robin_mapper with cpus-per-proc option

2014-01-22 Thread tmishima

Hi Ralph, I want to ask you one more thing about default setting of
num_procs
when we don't specify the -np option and we set the cpus-per-proc > 1.

In this case, the round_robin_mapper sets num_procs = num_slots as below:

rmaps_rr.c:
130if (0 == app->num_procs) {
131/* set the num_procs to equal the number of slots on these
mapped nodes */
132app->num_procs = num_slots;
133}

However, because of cpus_per_rank > 1, this num_procs will be refused at
the
line 61 in rmaps_rr_mappers.c as below, unless we switch on the
oversubscribe
directive.

rmaps_rr_mappers.c:
61if (num_slots < ((int)app->num_procs *
orte_rmaps_base.cpus_per_rank)) {
62if (ORTE_MAPPING_NO_OVERSUBSCRIBE & ORTE_GET_MAPPING_DIRECTIVE
(jdata->map->mapping)) {
63orte_show_help("help-orte-rmaps-base.txt",
"orte-rmaps-base:alloc-error",
64   true, app->num_procs, app->app);
65return ORTE_ERR_SILENT;
66}
67}

Therefore, I think the default num_procs should be equal to the number of
num_slots divided by cpus/rank:

   app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank;

This would be more convinient for most of people who want to use the
-cpus-per-proc option. I already confirmed it worked well. Please consider
to apply this fix to 1.7.4.

Regards,
Tetsuya Mishima



Re: [OMPI users] default num_procs of round_robin_mapper with cpus-per-proc option

2014-01-22 Thread Ralph Castain
Seems like a reasonable, minimal risk request - will do

On Jan 22, 2014, at 4:28 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> Hi Ralph, I want to ask you one more thing about default setting of
> num_procs
> when we don't specify the -np option and we set the cpus-per-proc > 1.
> 
> In this case, the round_robin_mapper sets num_procs = num_slots as below:
> 
> rmaps_rr.c:
> 130if (0 == app->num_procs) {
> 131/* set the num_procs to equal the number of slots on these
> mapped nodes */
> 132app->num_procs = num_slots;
> 133}
> 
> However, because of cpus_per_rank > 1, this num_procs will be refused at
> the
> line 61 in rmaps_rr_mappers.c as below, unless we switch on the
> oversubscribe
> directive.
> 
> rmaps_rr_mappers.c:
> 61if (num_slots < ((int)app->num_procs *
> orte_rmaps_base.cpus_per_rank)) {
> 62if (ORTE_MAPPING_NO_OVERSUBSCRIBE & ORTE_GET_MAPPING_DIRECTIVE
> (jdata->map->mapping)) {
> 63orte_show_help("help-orte-rmaps-base.txt",
> "orte-rmaps-base:alloc-error",
> 64   true, app->num_procs, app->app);
> 65return ORTE_ERR_SILENT;
> 66}
> 67}
> 
> Therefore, I think the default num_procs should be equal to the number of
> num_slots divided by cpus/rank:
> 
>   app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank;
> 
> This would be more convinient for most of people who want to use the
> -cpus-per-proc option. I already confirmed it worked well. Please consider
> to apply this fix to 1.7.4.
> 
> Regards,
> Tetsuya Mishima
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] default num_procs of round_robin_mapper with cpus-per-proc option

2014-01-22 Thread tmishima


Thanks, Ralph.

I have one more question. I'm sorry to ask you many things ...

Could you tell me the difference between "map-by slot" and "map-by core".
>From my understanding, slot is the synonym of core. But those behaviors
using openmpi-1.7.4rc2 with the cpus-per-proc option are quite different
as shown below. I tried to browse the source code but I could not make it
clear so far.

Regards,
Tetsuya Mishima

[ un-managed environment] (node05,06 has 8 cores each)

[mishima@manage work]$ cat pbs_hosts
node05
node05
node05
node05
node05
node05
node05
node05
node06
node06
node06
node06
node06
node06
node06
node06
[mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings
-cpus-per-proc 4 -map-by slot ~/mis/openmpi/dem
os/myprog
[node05.cluster:23949] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node05.cluster:23949] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
[node06.cluster:22139] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node06.cluster:22139] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 3 of 4
Hello world from process 2 of 4
[mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings
-cpus-per-proc 4 -map-by core ~/mis/openmpi/dem
os/myprog
[node05.cluster:23985] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
[./B/./.][./././.]
[node05.cluster:23985] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././.][./././.]
[node06.cluster:22175] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
[./B/./.][./././.]
[node06.cluster:22175] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
[B/././.][./././.]
Hello world from process 2 of 4
Hello world from process 3 of 4
Hello world from process 0 of 4
Hello world from process 1 of 4

(note) I have the same behavior in the managed environment by Torque

> Seems like a reasonable, minimal risk request - will do
>
> On Jan 22, 2014, at 4:28 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> > Hi Ralph, I want to ask you one more thing about default setting of
> > num_procs
> > when we don't specify the -np option and we set the cpus-per-proc > 1.
> >
> > In this case, the round_robin_mapper sets num_procs = num_slots as
below:
> >
> > rmaps_rr.c:
> > 130if (0 == app->num_procs) {
> > 131/* set the num_procs to equal the number of slots on
these
> > mapped nodes */
> > 132app->num_procs = num_slots;
> > 133}
> >
> > However, because of cpus_per_rank > 1, this num_procs will be refused
at
> > the
> > line 61 in rmaps_rr_mappers.c as below, unless we switch on the
> > oversubscribe
> > directive.
> >
> > rmaps_rr_mappers.c:
> > 61if (num_slots < ((int)app->num_procs *
> > orte_rmaps_base.cpus_per_rank)) {
> > 62if (ORTE_MAPPING_NO_OVERSUBSCRIBE &
ORTE_GET_MAPPING_DIRECTIVE
> > (jdata->map->mapping)) {
> > 63orte_show_help("help-orte-rmaps-base.txt",
> > "orte-rmaps-base:alloc-error",
> > 64   true, app->num_procs, app->app);
> > 65return ORTE_ERR_SILENT;
> > 66}
> > 67}
> >
> > Therefore, I think the default num_procs should be equal to the number
of
> > num_slots divided by cpus/rank:
> >
> >   app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank;
> >
> > This would be more convinient for most of people who want to use the
> > -cpus-per-proc option. I already confirmed it worked well. Please
consider
> > to apply this fix to 1.7.4.
> >
> > Regards,
> > Tetsuya Mishima
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users