Re: [OMPI users] runtime errors for openmpi-v2.x-dev-1280-gc110ae8

2016-04-29 Thread Ralph Castain
Ouch - thanks for finding this, Gilles! I’ll take care of it on Friday.

> On Apr 28, 2016, at 6:38 PM, Gilles Gouaillardet 
>  wrote:
> 
> Siegmar,
> 
> in pmix_bfrop_pack_app,
> app->argc
> must be replaced with
> app[i].argc
> 
> I will PR to pmix, ompi and ompi-release when I am back at work on Monday
> 
> Cheers,
> 
> Gilles
> 
> On Thursday, April 28, 2016, Gilles Gouaillardet  > wrote:
> Siegmar,
> 
> 
> 
> can you please also post the source of spawn_slave ?
> 
> 
> 
> Cheers,
> 
> Gilles
> 
> 
> On 4/28/2016 1:17 AM, Siegmar Gross wrote:
>> Hi Gilles, 
>> 
>> it is not necessary to have a heterogeneous environment to reproduce 
>> the error as you can see below. All machines are 64 bit. 
>> 
>> tyr spawn 119 ompi_info | grep -e "OPAL repo revision" -e "C compiler 
>> absolute" 
>>   OPAL repo revision: v2.x-dev-1290-gbd0e4e1 
>>  C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc 
>> tyr spawn 120 uname -a 
>> SunOS tyr.informatik.hs-fulda.de  5.10 
>> Generic_150400-11 sun4u sparc SUNW,A70 Solaris 
>> tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,tyr spawn_multiple_master 
>> 
>> Parent process 0 running on tyr.informatik.hs-fulda.de 
>>  
>>   I create 3 slave processes. 
>> 
>> [tyr.informatik.hs-fulda.de:27286 
>> ] PMIX ERROR: UNPACK-PAST-END in 
>> file 
>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
>>  at line 829 
>> [tyr.informatik.hs-fulda.de:27286 
>> ] PMIX ERROR: UNPACK-PAST-END in 
>> file 
>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
>>  at line 2176 
>> [tyr:27288] *** An error occurred in MPI_Comm_spawn_multiple 
>> [tyr:27288] *** reported by process [3434086401,0] 
>> [tyr:27288] *** on communicator MPI_COMM_WORLD 
>> [tyr:27288] *** MPI_ERR_SPAWN: could not spawn processes 
>> [tyr:27288] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>> now abort, 
>> [tyr:27288] ***and potentially your MPI job) 
>> tyr spawn 122 
>> 
>> 
>> 
>> 
>> 
>> 
>> sunpc1 fd1026 105 ompi_info | grep -e "OPAL repo revision" -e "C compiler 
>> absolute" 
>>   OPAL repo revision: v2.x-dev-1290-gbd0e4e1 
>>  C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc 
>> sunpc1 fd1026 106 uname -a 
>> SunOS sunpc1 5.10 Generic_147441-21 i86pc i386 i86pc Solaris 
>> sunpc1 fd1026 107 mpiexec -np 1 --host sunpc1,sunpc1,sunpc1,sunpc1 
>> spawn_multiple_master 
>> 
>> Parent process 0 running on sunpc1 
>>   I create 3 slave processes. 
>> 
>> [sunpc1:00368] PMIX ERROR: UNPACK-PAST-END in file 
>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
>>  at line 829 
>> [sunpc1:00368] PMIX ERROR: UNPACK-PAST-END in file 
>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
>>  at line 2176 
>> [sunpc1:370] *** An error occurred in MPI_Comm_spawn_multiple 
>> [sunpc1:370] *** reported by process [43909121,0] 
>> [sunpc1:370] *** on communicator MPI_COMM_WORLD 
>> [sunpc1:370] *** MPI_ERR_SPAWN: could not spawn processes 
>> [sunpc1:370] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>> now abort, 
>> [sunpc1:370] ***and potentially your MPI job) 
>> sunpc1 fd1026 108 
>> 
>> 
>> 
>> 
>> 
>> linpc1 fd1026 105 ompi_info | grep -e "OPAL repo revision" -e "C compiler 
>> absolute" 
>>   OPAL repo revision: v2.x-dev-1290-gbd0e4e1 
>>  C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc 
>> linpc1 fd1026 106 uname -a 
>> Linux linpc1 3.1.10-1.29-desktop #1 SMP PREEMPT Fri May 31 20:10:04 UTC 2013 
>> (2529847) x86_64 x86_64 x86_64 GNU/Linux 
>> linpc1 fd1026 107 mpiexec -np 1 --host linpc1,linpc1,linpc1,linpc1 
>> spawn_multiple_master 
>> 
>> Parent process 0 running on linpc1 
>>   I create 3 slave processes. 
>> 
>> [linpc1:21502] PMIX ERROR: UNPACK-PAST-END in file 
>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
>>  at line 829 
>> [linpc1:21502] PMIX ERROR: UNPACK-PAST-END in file 
>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
>>  at line 2176 
>> [linpc1:21507] *** An error occurred in MPI_Comm_spawn_multiple 
>> [linpc1:21507] *** reported by process [1005518849,0] 
>> [linpc1:21507] *** on communicator MPI_COMM_WORLD 
>> [linpc1:21507] *** MPI_ERR_SPAWN: could not spawn processes 
>> [linpc1:21507] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
>> now abort, 
>> [linpc1:21507] ***and potentially your MPI job) 
>> linpc1 fd1026 108 
>> 
>> 
>> I used the following configure command. 
>> 
>> ../openmpi-v2.x-dev-1290-gbd0e4e1/configure \ 
>>   --prefix=/usr/local/openmpi-2.0.0_64_gcc \ 
>>   --libdir=/usr/local/openmpi-2.0.0_6

Re: [OMPI users] runtime errors for openmpi-v2.x-dev-1280-gc110ae8

2016-04-29 Thread Siegmar Gross

Hi Gilles,

thank you very much for identifying the reason for the problem
and fixing it.

Have a nice weekend

Siegmar

Am 29.04.2016 um 03:38 schrieb Gilles Gouaillardet:

Siegmar,

in pmix_bfrop_pack_app,
app->argc
must be replaced with
app[i].argc

I will PR to pmix, ompi and ompi-release when I am back at work on Monday

Cheers,

Gilles

On Thursday, April 28, 2016, Gilles Gouaillardet mailto:gil...@rist.or.jp>> wrote:

Siegmar,


can you please also post the source of spawn_slave ?


Cheers,

Gilles


On 4/28/2016 1:17 AM, Siegmar Gross wrote:

Hi Gilles,

it is not necessary to have a heterogeneous environment to reproduce
the error as you can see below. All machines are 64 bit.

tyr spawn 119 ompi_info | grep -e "OPAL repo revision" -e "C compiler
absolute"
  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
 C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
tyr spawn 120 uname -a
SunOS tyr.informatik.hs-fulda.de 
5.10 Generic_150400-11 sun4u sparc SUNW,A70 Solaris
tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,tyr spawn_multiple_master

Parent process 0 running on tyr.informatik.hs-fulda.de

  I create 3 slave processes.

[tyr.informatik.hs-fulda.de:27286
] PMIX ERROR: UNPACK-PAST-END
in file

../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
at line 829
[tyr.informatik.hs-fulda.de:27286
] PMIX ERROR: UNPACK-PAST-END
in file

../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
at line 2176
[tyr:27288] *** An error occurred in MPI_Comm_spawn_multiple
[tyr:27288] *** reported by process [3434086401,0]
[tyr:27288] *** on communicator MPI_COMM_WORLD
[tyr:27288] *** MPI_ERR_SPAWN: could not spawn processes
[tyr:27288] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[tyr:27288] ***and potentially your MPI job)
tyr spawn 122






sunpc1 fd1026 105 ompi_info | grep -e "OPAL repo revision" -e "C
compiler absolute"
  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
 C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
sunpc1 fd1026 106 uname -a
SunOS sunpc1 5.10 Generic_147441-21 i86pc i386 i86pc Solaris
sunpc1 fd1026 107 mpiexec -np 1 --host sunpc1,sunpc1,sunpc1,sunpc1
spawn_multiple_master

Parent process 0 running on sunpc1
  I create 3 slave processes.

[sunpc1:00368] PMIX ERROR: UNPACK-PAST-END in file

../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
at line 829
[sunpc1:00368] PMIX ERROR: UNPACK-PAST-END in file

../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
at line 2176
[sunpc1:370] *** An error occurred in MPI_Comm_spawn_multiple
[sunpc1:370] *** reported by process [43909121,0]
[sunpc1:370] *** on communicator MPI_COMM_WORLD
[sunpc1:370] *** MPI_ERR_SPAWN: could not spawn processes
[sunpc1:370] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[sunpc1:370] ***and potentially your MPI job)
sunpc1 fd1026 108





linpc1 fd1026 105 ompi_info | grep -e "OPAL repo revision" -e "C
compiler absolute"
  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
 C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
linpc1 fd1026 106 uname -a
Linux linpc1 3.1.10-1.29-desktop #1 SMP PREEMPT Fri May 31 20:10:04 UTC
2013 (2529847) x86_64 x86_64 x86_64 GNU/Linux
linpc1 fd1026 107 mpiexec -np 1 --host linpc1,linpc1,linpc1,linpc1
spawn_multiple_master

Parent process 0 running on linpc1
  I create 3 slave processes.

[linpc1:21502] PMIX ERROR: UNPACK-PAST-END in file

../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
at line 829
[linpc1:21502] PMIX ERROR: UNPACK-PAST-END in file

../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
at line 2176
[linpc1:21507] *** An error occurred in MPI_Comm_spawn_multiple
[linpc1:21507] *** reported by process [1005518849,0]
[linpc1:21507] *** on communicator MPI_COMM_WORLD
[linpc1:21507] *** MPI_ERR_SPAWN: could not spawn processes
[linpc1:21507] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[linpc1:21507] ***and potentially your MPI job)
linpc1 fd1026 108


I used the following configure command.

../openmpi-v2.x-dev-1290-gbd0e4e1/configure \
  --prefix=/usr/local/openmpi-2.0.0_64_gcc \
  --libdir=/usr/local/openmpi-2.0.0_64_gcc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0/bin \

Re: [OMPI users] runtime errors for openmpi-v2.x-dev-1280-gc110ae8

2016-04-29 Thread Ralph Castain
Hmmm…well, I may have to wait and let Gilles fix this. So far as I can see, the 
code in the current OMPI 2.x tarball (and upstream) is correct:

int pmix_bfrop_pack_app(pmix_buffer_t *buffer, const void *src,
int32_t num_vals, pmix_data_type_t type)
{
pmix_app_t *app;
int32_t i, j, nvals;
int ret;

app = (pmix_app_t *) src;

for (i = 0; i < num_vals; ++i) {
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, &app[i].cmd, 
1, PMIX_STRING))) {
return ret;
}
/* argv */
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer, &app[i].argc, 1, 
PMIX_INT))) {
return ret;
}
for (j=0; j < app->argc; j++) {
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
&app[i].argv[j], 1, PMIX_STRING))) {
return ret;
}
}
/* env */
nvals = pmix_argv_count(app[i].env);
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int32(buffer, &nvals, 1, 
PMIX_INT32))) {
return ret;
}
for (j=0; j < nvals; j++) {
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
&app[i].env[j], 1, PMIX_STRING))) {
return ret;
}
}
/* maxprocs */
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer, 
&app[i].maxprocs, 1, PMIX_INT))) {
return ret;
}
/* info array */
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_sizet(buffer, &app[i].ninfo, 
1, PMIX_SIZE))) {
return ret;
}
if (0 < app[i].ninfo) {
if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_info(buffer, 
app[i].info, app[i].ninfo, PMIX_INFO))) {
return ret;
}
}
}
return PMIX_SUCCESS;
}

Siegmar: have you tried the latest release candidate?


> On Apr 28, 2016, at 11:08 PM, Siegmar Gross 
>  wrote:
> 
> Hi Gilles,
> 
> thank you very much for identifying the reason for the problem
> and fixing it.
> 
> Have a nice weekend
> 
> Siegmar
> 
> Am 29.04.2016 um 03:38 schrieb Gilles Gouaillardet:
>> Siegmar,
>> 
>> in pmix_bfrop_pack_app,
>> app->argc
>> must be replaced with
>> app[i].argc
>> 
>> I will PR to pmix, ompi and ompi-release when I am back at work on Monday
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Thursday, April 28, 2016, Gilles Gouaillardet > >> wrote:
>> 
>>Siegmar,
>> 
>> 
>>can you please also post the source of spawn_slave ?
>> 
>> 
>>Cheers,
>> 
>>Gilles
>> 
>> 
>>On 4/28/2016 1:17 AM, Siegmar Gross wrote:
>>>Hi Gilles,
>>> 
>>>it is not necessary to have a heterogeneous environment to reproduce
>>>the error as you can see below. All machines are 64 bit.
>>> 
>>>tyr spawn 119 ompi_info | grep -e "OPAL repo revision" -e "C compiler
>>>absolute"
>>>  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
>>> C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
>>>tyr spawn 120 uname -a
>>>SunOS tyr.informatik.hs-fulda.de  
>>> >
>>>5.10 Generic_150400-11 sun4u sparc SUNW,A70 Solaris
>>>tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,tyr spawn_multiple_master
>>> 
>>>Parent process 0 running on tyr.informatik.hs-fulda.de 
>>> 
>>>>
>>>  I create 3 slave processes.
>>> 
>>>[tyr.informatik.hs-fulda.de :27286
>>>>> >] PMIX ERROR: UNPACK-PAST-END
>>>in file
>>>
>>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
>>>at line 829
>>>[tyr.informatik.hs-fulda.de :27286
>>>>> >] PMIX ERROR: UNPACK-PAST-END
>>>in file
>>>
>>> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
>>>at line 2176
>>>[tyr:27288] *** An error occurred in MPI_Comm_spawn_multiple
>>>[tyr:27288] *** reported by process [3434086401,0]
>>>[tyr:27288] *** on communicator MPI_COMM_WORLD
>>>[tyr:27288] *** MPI_ERR_SPAWN: could not spawn processes
>>>[tyr:27288] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>>will now abort,
>>>[tyr:27288] ***and potentially your MPI job)
>>>tyr spawn 122
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>sunpc1 fd1026 105 ompi_info | grep -e "OPAL repo revision" -e "C
>>>compiler absolute"
>>>  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
>>> C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
>>>sunpc1 fd1026 106 uname -a
>>>SunOS sunpc1 5.10 

Re: [OMPI users] runtime errors for openmpi-v2.x-dev-1280-gc110ae8

2016-04-29 Thread Gilles Gouaillardet
the second for loop is incorrect

it reads
for (j=0; j < app->argc; j++)
but should be
for (j=0; j < app[i].argc; j++)

as a matter of taste, I'd rather replace all app[i]. with app->
and
app++;
at the end (or in the for) of the outermost loop

Cheers,

Gilles


On Friday, April 29, 2016, Ralph Castain  wrote:

> Hmmm…well, I may have to wait and let Gilles fix this. So far as I can
> see, the code in the current OMPI 2.x tarball (and upstream) is correct:
>
> int pmix_bfrop_pack_app(pmix_buffer_t *buffer, const void *src,
> int32_t num_vals, pmix_data_type_t type)
> {
> pmix_app_t *app;
> int32_t i, j, nvals;
> int ret;
>
> app = (pmix_app_t *) src;
>
> for (i = 0; i < num_vals; ++i) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer,
> &app[i].cmd, 1, PMIX_STRING))) {
> return ret;
> }
> /* argv */
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer,
> &app[i].argc, 1, PMIX_INT))) {
> return ret;
> }
> for (j=0; j < app->argc; j++) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer,
> &app[i].argv[j], 1, PMIX_STRING))) {
> return ret;
> }
> }
> /* env */
> nvals = pmix_argv_count(app[i].env);
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int32(buffer, &nvals,
> 1, PMIX_INT32))) {
> return ret;
> }
> for (j=0; j < nvals; j++) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer,
> &app[i].env[j], 1, PMIX_STRING))) {
> return ret;
> }
> }
> /* maxprocs */
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer,
> &app[i].maxprocs, 1, PMIX_INT))) {
> return ret;
> }
> /* info array */
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_sizet(buffer,
> &app[i].ninfo, 1, PMIX_SIZE))) {
> return ret;
> }
> if (0 < app[i].ninfo) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_info(buffer,
> app[i].info, app[i].ninfo, PMIX_INFO))) {
> return ret;
> }
> }
> }
> return PMIX_SUCCESS;
> }
>
> Siegmar: have you tried the latest release candidate?
>
>
> On Apr 28, 2016, at 11:08 PM, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de
> >
> wrote:
>
> Hi Gilles,
>
> thank you very much for identifying the reason for the problem
> and fixing it.
>
> Have a nice weekend
>
> Siegmar
>
> Am 29.04.2016 um 03:38 schrieb Gilles Gouaillardet:
>
> Siegmar,
>
> in pmix_bfrop_pack_app,
> app->argc
> must be replaced with
> app[i].argc
>
> I will PR to pmix, ompi and ompi-release when I am back at work on Monday
>
> Cheers,
>
> Gilles
>
> On Thursday, April 28, 2016, Gilles Gouaillardet  
>  >> wrote:
>
>Siegmar,
>
>
>can you please also post the source of spawn_slave ?
>
>
>Cheers,
>
>Gilles
>
>
>On 4/28/2016 1:17 AM, Siegmar Gross wrote:
>
>Hi Gilles,
>
>it is not necessary to have a heterogeneous environment to reproduce
>the error as you can see below. All machines are 64 bit.
>
>tyr spawn 119 ompi_info | grep -e "OPAL repo revision" -e "C compiler
>absolute"
>  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
> C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
>tyr spawn 120 uname -a
>SunOS tyr.informatik.hs-fulda.de 
>5.10 Generic_150400-11 sun4u sparc SUNW,A70 Solaris
>tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,tyr spawn_multiple_master
>
>Parent process 0 running on tyr.informatik.hs-fulda.de
>
>  I create 3 slave processes.
>
>[tyr.informatik.hs-fulda.de:27286
>] PMIX ERROR: UNPACK-PAST-END
>in file
>
>
> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
>at line 829
>[tyr.informatik.hs-fulda.de:27286
>] PMIX ERROR: UNPACK-PAST-END
>in file
>
>
> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c
>at line 2176
>[tyr:27288] *** An error occurred in MPI_Comm_spawn_multiple
>[tyr:27288] *** reported by process [3434086401,0]
>[tyr:27288] *** on communicator MPI_COMM_WORLD
>[tyr:27288] *** MPI_ERR_SPAWN: could not spawn processes
>[tyr:27288] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>will now abort,
>[tyr:27288] ***and potentially your MPI job)
>tyr spawn 122
>
>
>
>
>
>
>sunpc1 fd1026 105 ompi_info | grep -e "OPAL repo revision" -e "C
>compiler absolute"
>  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
> C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
>sunpc1 fd1026 106 uname -a
>SunOS sunpc1 5.10 Generic_147441-21 i86pc i386 i

Re: [OMPI users] runtime errors for openmpi-v2.x-dev-1280-gc110ae8

2016-04-29 Thread Ralph Castain
Ah, okay - I can fix that line. Thanks for pointing it out.

Given that the rest of the code uses the app[i] syntax, I’d rather leave that 
alone.


> On Apr 29, 2016, at 7:27 AM, Gilles Gouaillardet 
>  wrote:
> 
> the second for loop is incorrect
> 
> it reads
> for (j=0; j < app->argc; j++)
> but should be
> for (j=0; j < app[i].argc; j++)
> 
> as a matter of taste, I'd rather replace all app[i]. with app->
> and
> app++;
> at the end (or in the for) of the outermost loop
> 
> Cheers,
> 
> Gilles
> 
> 
> On Friday, April 29, 2016, Ralph Castain  > wrote:
> Hmmm…well, I may have to wait and let Gilles fix this. So far as I can see, 
> the code in the current OMPI 2.x tarball (and upstream) is correct:
> 
> int pmix_bfrop_pack_app(pmix_buffer_t *buffer, const void *src,
> int32_t num_vals, pmix_data_type_t type)
> {
> pmix_app_t *app;
> int32_t i, j, nvals;
> int ret;
> 
> app = (pmix_app_t *) src;
> 
> for (i = 0; i < num_vals; ++i) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
> &app[i].cmd, 1, PMIX_STRING))) {
> return ret;
> }
> /* argv */
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer, &app[i].argc, 
> 1, PMIX_INT))) {
> return ret;
> }
> for (j=0; j < app->argc; j++) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
> &app[i].argv[j], 1, PMIX_STRING))) {
> return ret;
> }
> }
> /* env */
> nvals = pmix_argv_count(app[i].env);
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int32(buffer, &nvals, 1, 
> PMIX_INT32))) {
> return ret;
> }
> for (j=0; j < nvals; j++) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
> &app[i].env[j], 1, PMIX_STRING))) {
> return ret;
> }
> }
> /* maxprocs */
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer, 
> &app[i].maxprocs, 1, PMIX_INT))) {
> return ret;
> }
> /* info array */
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_sizet(buffer, 
> &app[i].ninfo, 1, PMIX_SIZE))) {
> return ret;
> }
> if (0 < app[i].ninfo) {
> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_info(buffer, 
> app[i].info, app[i].ninfo, PMIX_INFO))) {
> return ret;
> }
> }
> }
> return PMIX_SUCCESS;
> }
> 
> Siegmar: have you tried the latest release candidate?
> 
> 
>> On Apr 28, 2016, at 11:08 PM, Siegmar Gross 
>> > wrote:
>> 
>> Hi Gilles,
>> 
>> thank you very much for identifying the reason for the problem
>> and fixing it.
>> 
>> Have a nice weekend
>> 
>> Siegmar
>> 
>> Am 29.04.2016 um 03:38 schrieb Gilles Gouaillardet:
>>> Siegmar,
>>> 
>>> in pmix_bfrop_pack_app,
>>> app->argc
>>> must be replaced with
>>> app[i].argc
>>> 
>>> I will PR to pmix, ompi and ompi-release when I am back at work on Monday
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Thursday, April 28, 2016, Gilles Gouaillardet 
>>> >> wrote:
>>> 
>>>Siegmar,
>>> 
>>> 
>>>can you please also post the source of spawn_slave ?
>>> 
>>> 
>>>Cheers,
>>> 
>>>Gilles
>>> 
>>> 
>>>On 4/28/2016 1:17 AM, Siegmar Gross wrote:
Hi Gilles,
 
it is not necessary to have a heterogeneous environment to reproduce
the error as you can see below. All machines are 64 bit.
 
tyr spawn 119 ompi_info | grep -e "OPAL repo revision" -e "C compiler
absolute"
  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
 C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
tyr spawn 120 uname -a
SunOS tyr.informatik.hs-fulda.de  
 >
5.10 Generic_150400-11 sun4u sparc SUNW,A70 Solaris
tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,tyr spawn_multiple_master
 
Parent process 0 running on tyr.informatik.hs-fulda.de 
 
>
  I create 3 slave processes.
 
[tyr.informatik.hs-fulda.de :27286
>] PMIX ERROR: UNPACK-PAST-END
in file

 ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c
at line 829
[tyr.informatik.hs-fulda.de :27286
>] PMIX ERROR: UNPACK-PAST-END
in file

 ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pmix/pmix112/pmix/s

Re: [OMPI users] runtime errors for openmpi-v2.x-dev-1280-gc110ae8

2016-04-29 Thread Ralph Castain
https://github.com/open-mpi/ompi-release/pull/1117 



> On Apr 29, 2016, at 7:38 AM, Ralph Castain  wrote:
> 
> Ah, okay - I can fix that line. Thanks for pointing it out.
> 
> Given that the rest of the code uses the app[i] syntax, I’d rather leave that 
> alone.
> 
> 
>> On Apr 29, 2016, at 7:27 AM, Gilles Gouaillardet 
>> mailto:gilles.gouaillar...@gmail.com>> wrote:
>> 
>> the second for loop is incorrect
>> 
>> it reads
>> for (j=0; j < app->argc; j++)
>> but should be
>> for (j=0; j < app[i].argc; j++)
>> 
>> as a matter of taste, I'd rather replace all app[i]. with app->
>> and
>> app++;
>> at the end (or in the for) of the outermost loop
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> On Friday, April 29, 2016, Ralph Castain > > wrote:
>> Hmmm…well, I may have to wait and let Gilles fix this. So far as I can see, 
>> the code in the current OMPI 2.x tarball (and upstream) is correct:
>> 
>> int pmix_bfrop_pack_app(pmix_buffer_t *buffer, const void *src,
>> int32_t num_vals, pmix_data_type_t type)
>> {
>> pmix_app_t *app;
>> int32_t i, j, nvals;
>> int ret;
>> 
>> app = (pmix_app_t *) src;
>> 
>> for (i = 0; i < num_vals; ++i) {
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
>> &app[i].cmd, 1, PMIX_STRING))) {
>> return ret;
>> }
>> /* argv */
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer, &app[i].argc, 
>> 1, PMIX_INT))) {
>> return ret;
>> }
>> for (j=0; j < app->argc; j++) {
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
>> &app[i].argv[j], 1, PMIX_STRING))) {
>> return ret;
>> }
>> }
>> /* env */
>> nvals = pmix_argv_count(app[i].env);
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int32(buffer, &nvals, 1, 
>> PMIX_INT32))) {
>> return ret;
>> }
>> for (j=0; j < nvals; j++) {
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_string(buffer, 
>> &app[i].env[j], 1, PMIX_STRING))) {
>> return ret;
>> }
>> }
>> /* maxprocs */
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_int(buffer, 
>> &app[i].maxprocs, 1, PMIX_INT))) {
>> return ret;
>> }
>> /* info array */
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_sizet(buffer, 
>> &app[i].ninfo, 1, PMIX_SIZE))) {
>> return ret;
>> }
>> if (0 < app[i].ninfo) {
>> if (PMIX_SUCCESS != (ret = pmix_bfrop_pack_info(buffer, 
>> app[i].info, app[i].ninfo, PMIX_INFO))) {
>> return ret;
>> }
>> }
>> }
>> return PMIX_SUCCESS;
>> }
>> 
>> Siegmar: have you tried the latest release candidate?
>> 
>> 
>>> On Apr 28, 2016, at 11:08 PM, Siegmar Gross 
>>> > wrote:
>>> 
>>> Hi Gilles,
>>> 
>>> thank you very much for identifying the reason for the problem
>>> and fixing it.
>>> 
>>> Have a nice weekend
>>> 
>>> Siegmar
>>> 
>>> Am 29.04.2016 um 03:38 schrieb Gilles Gouaillardet:
 Siegmar,
 
 in pmix_bfrop_pack_app,
 app->argc
 must be replaced with
 app[i].argc
 
 I will PR to pmix, ompi and ompi-release when I am back at work on Monday
 
 Cheers,
 
 Gilles
 
 On Thursday, April 28, 2016, Gilles Gouaillardet 
 >> wrote:
 
Siegmar,
 
 
can you please also post the source of spawn_slave ?
 
 
Cheers,
 
Gilles
 
 
On 4/28/2016 1:17 AM, Siegmar Gross wrote:
>Hi Gilles,
> 
>it is not necessary to have a heterogeneous environment to reproduce
>the error as you can see below. All machines are 64 bit.
> 
>tyr spawn 119 ompi_info | grep -e "OPAL repo revision" -e "C compiler
>absolute"
>  OPAL repo revision: v2.x-dev-1290-gbd0e4e1
> C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
>tyr spawn 120 uname -a
>SunOS tyr.informatik.hs-fulda.de  
> >
>5.10 Generic_150400-11 sun4u sparc SUNW,A70 Solaris
>tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,tyr 
> spawn_multiple_master
> 
>Parent process 0 running on tyr.informatik.hs-fulda.de 
> 
> >
>  I create 3 slave processes.
> 
>[tyr.informatik.hs-fulda.de :27286
> >] PMIX ERROR: UNPACK-PAST-END
>in file
>
> ../../../../../../openmpi-v2.x-dev-1290-gbd0e4e1/opal/mca/pm

Re: [OMPI users] runtime errors for openmpi-v2.x-dev-1280-gc110ae8

2016-04-29 Thread Siegmar Gross

Hi Ralph,


Siegmar: have you tried the latest release candidate?


Yes, it is still broken.


Kind regards and thank you very much for your help

Siegmar


Re: [OMPI users] OpenMPI MPMD Support

2016-04-29 Thread Scott Shaw
I am using a -app file to run a serial application on N number of compute nodes 
and each compute node has 24 cores available. If I only want to use one core to 
execute the serial app I get a "not enough slots available" error when running 
OMPI.  How do you define the slots parameter to inform OMPI that a total of 24 
cores are available per node when using a app file. I need to contain all 
parameters in the -app file since any additional options passed on the mpirun 
command line are ignored.

io/jobs> mpirun -V
mpirun (Open MPI) 1.10.2

io/jobs> mpirun --app cmd.file
--
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  uptime

Either request fewer slots for your application, or make more slots available
for use.
--

  io/jobs> cat cmd.file
--host hosta -np 1 convertslice input1 output1
--host hosta -np 1 convertslice input2 output2
--host hostb -np 1 convertslice input3 output3
--host hostb -np 1 convertslice input4 output4

Following is the lscpu output from one of the compute nodes showing 24 cores 
and 24 HTs available.
io/jobs> lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):48
On-line CPU(s) list:   0-47
Thread(s) per core:2
Core(s) per socket:12
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 63
Stepping:  2
CPU MHz:   2500.092
BogoMIPS:  4999.93
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

Any guidance would be greatly appreciated.

Thanks,
Scott



Re: [OMPI users] OpenMPI MPMD Support

2016-04-29 Thread Fabricio Cannini

On 29-04-2016 14:59, Scott Shaw wrote:

I am using a –app file to run a serial application on N number of
compute nodes and each compute node has 24 cores available. If I only
want to use one core to execute the serial app I get a “not enough slots
available” error when running OMPI.  How do you define the slots
parameter to inform OMPI that a total of 24 cores are available per node
when using a app file. I need to contain all parameters in the –app file
since any additional options passed on the mpirun command line are ignored.


Hello Scott

You may want to take a look at gnu parallel:
https://www.gnu.org/software/parallel/

[ ]'s


Re: [OMPI users] OpenMPI MPMD Support

2016-04-29 Thread Ralph Castain
This might be a bug that has been fixed - can you try the 1.10.3rc? If it 
doesn’t work, I’ll try to quickly fix it.

> On Apr 29, 2016, at 10:59 AM, Scott Shaw  wrote:
> 
> I am using a –app file to run a serial application on N number of compute 
> nodes and each compute node has 24 cores available. If I only want to use one 
> core to execute the serial app I get a “not enough slots available” error 
> when running OMPI.  How do you define the slots parameter to inform OMPI that 
> a total of 24 cores are available per node when using a app file. I need to 
> contain all parameters in the –app file since any additional options passed 
> on the mpirun command line are ignored.
>  
> io/jobs> mpirun -V
> mpirun (Open MPI) 1.10.2
>  
> io/jobs> mpirun --app cmd.file
> --
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:
>   uptime
>  
> Either request fewer slots for your application, or make more slots available
> for use.
> --
>  
>   io/jobs> cat cmd.file
> --host hosta -np 1 convertslice input1 output1
> --host hosta -np 1 convertslice input2 output2
> --host hostb -np 1 convertslice input3 output3
> --host hostb -np 1 convertslice input4 output4
>  
> Following is the lscpu output from one of the compute nodes showing 24 cores 
> and 24 HTs available.
> io/jobs> lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):48
> On-line CPU(s) list:   0-47
> Thread(s) per core:2
> Core(s) per socket:12
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 63
> Stepping:  2
> CPU MHz:   2500.092
> BogoMIPS:  4999.93
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  30720K
> NUMA node0 CPU(s): 0-11,24-35
> NUMA node1 CPU(s): 12-23,36-47
>  
> Any guidance would be greatly appreciated. 
>  
> Thanks,
> Scott
>  
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/04/29055.php 
> 


Re: [OMPI users] OpenMPI MPMD Support

2016-04-29 Thread Scott Shaw
Thanks for the responses.  I will try 1.10.3rc release and see this addresses 
the issue.



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, April 29, 2016 2:30 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI MPMD Support

This might be a bug that has been fixed - can you try the 1.10.3rc? If it 
doesn’t work, I’ll try to quickly fix it.

On Apr 29, 2016, at 10:59 AM, Scott Shaw mailto:ss...@sgi.com>> 
wrote:

I am using a –app file to run a serial application on N number of compute nodes 
and each compute node has 24 cores available. If I only want to use one core to 
execute the serial app I get a “not enough slots available” error when running 
OMPI.  How do you define the slots parameter to inform OMPI that a total of 24 
cores are available per node when using a app file. I need to contain all 
parameters in the –app file since any additional options passed on the mpirun 
command line are ignored.

io/jobs> mpirun -V
mpirun (Open MPI) 1.10.2

io/jobs> mpirun --app cmd.file
--
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  uptime

Either request fewer slots for your application, or make more slots available
for use.
--

  io/jobs> cat cmd.file
--host hosta -np 1 convertslice input1 output1
--host hosta -np 1 convertslice input2 output2
--host hostb -np 1 convertslice input3 output3
--host hostb -np 1 convertslice input4 output4

Following is the lscpu output from one of the compute nodes showing 24 cores 
and 24 HTs available.
io/jobs> lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):48
On-line CPU(s) list:   0-47
Thread(s) per core:2
Core(s) per socket:12
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 63
Stepping:  2
CPU MHz:   2500.092
BogoMIPS:  4999.93
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

Any guidance would be greatly appreciated.

Thanks,
Scott

___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/04/29055.php



[OMPI users] MPI Datatypes and RMA

2016-04-29 Thread Palmer, Bruce J
I've been trying to recreate the semantics of the Global Array gather and 
scatter operations using MPI RMA routines and I've run into some issues with 
MPI Datatypes. I've been focusing on building MPI versions of the GA gather and 
scatter calls, which I've been trying to implement using MPI data types built 
with the MPI_Type_create_struct call. I've developed a test program that 
simulates copying data into and out of a 1D distributed array of size NSIZE. 
Each processor contains a segment of approximately size NSIZE/nproc and is 
responsible for assigning every nprocth value in the array starting with the 
value indexed by the rank of the array. After assigning values and 
synchronizing the distributed data structure, each processor then reads the 
values set by the processor of next higher rank (the process with rank nproc-1 
reads the values set by process 0).

The distributed array is represented by and MPI window and created using a 
standard MPI_Win_create call. The values in the array are set and read using 
MPI RMA operations, either MPI_Get/MPI_Put or MPI_Rget/MPI_Rput. Three 
different protocols have been used. The first is to call MPI_Win_lock and 
create a shared lock on the remote processor, then call MPI_Put/MPI_Get and 
then call MPI_Win_unlock to clear the lock. The second protocol is to use MPI 
request-based calls. After the call to MPI_Win_create, MPI_Win_lock_all is 
called to start a passive synchronization epoch on the window. Data is written 
and read to the distributed array using MPI_Rput/MPI_Rget immediately followed 
by a call to MPI_Wait, using the handle returned by the MPI_Rput/MPI_Rget call. 
The third protocol also immediately creates a passive synchronization epoch 
after window creation, but uses calls to MPI_Put/MPI_Get immediately followed 
by a call to MPI_Win_flush_local. These three protocols seem to cover all the 
possibilities that I have seen in other MPI/RMA based implementations of 
ARMCI/GA.

The issue that I've run into is that these tests seem to work reliably if I 
build the data type using the MPI_Type_create_subbarray function but fail for 
larger arrays (NSIZE ~ 1) when I use MPI_Type_create_struct. Because the 
values being set by each processor are evenly spaced, I can use either function 
in this case (this is not generally true in applications). The struct data type 
hangs on 2 processors using lock/unlock, crashes for the request-based protocol 
and does not get the correct values in the Get phase of the data transfer when 
using flush_local. These tests are done on a Linux cluster using an Infiniband 
interconnect and the value of NSIZE is 1. For comparison, the same test 
using MPI_Type_create_subarray seems to function reliably for all three 
protocols for NSIZE=100 using 1,2,8 processors on 1 and 2 SMP nodes.

I've attached the test program for these test cases. Does anyone have a 
suggestion about what might be going on here?

Bruce
#include "mpi.h"
#include 
#include 

#define NSIZE 1000/* size of array */

/*
 * To run this test program using the lock/unlock protocol, comment out both
 * defines for USE_MPI_REQUESTS and USE_MPI_FLUSH_LOCAL
 *
 * To run this test program using the request-based protocol with Rput, Rget,
 * uncomment the definition USE_MPI_REQUESTS and comment out the definition
 * USE_MPI_FLUSH_LOCAL
 *
 * To run this test program using the flush_local protocol, uncomment the
 * definitions for both USE_MPI_FLUSH_LOCAL and USE_MPI_REQUESTS
 *
 * The program can be converted to used MPI_Datatypes set up using the command
 * MPI_Type_create_subarray commenting out the definition USE_STRUCTS
 */

/*
#define USE_MPI_REQUESTS
#define USE_MPI_FLUSH_LOCAL
*/

#ifdef USE_MPI_FLUSH_LOCAL
#define USE_MPI_REQUESTS
#endif

/*
 * To run this program using the MPI_Type_create_subarray instead of the
 * MPI_Type_create_struct routine, comment out the USE_STRUCTS definition
 */

#define USE_STRUCTS

void do_work(MPI_Comm comm, int offset)
{
  int one = 1;
  int me, nproc, wme;
  int i, j, iproc;
  int dims = NSIZE;
  int lo, hi, mysize;
  int nval, icnt, jcnt;
  int *values;
  int **index;
  int *ival;
  int sok, ok;
  int *local_buf;
  MPI_Win win;

  MPI_Comm_size(comm, &nproc);
  MPI_Comm_rank(comm, &me);
  MPI_Comm_rank(MPI_COMM_WORLD, &wme);

  /* Print out which protocol is being used */
  if (me==0) {
#ifdef USE_MPI_REQUESTS
#ifdef USE_MPI_FLUSH_LOCAL
printf("\nUsing flush local protocol\n");
#else
printf("\nUsing request-based protocol\n");
#endif
#else
printf("\nUsing lock/unlock protocol\n");
#endif
#ifdef USE_STRUCTS
printf("\nBuilding data types using stuct command\n");
#else
printf("\nBuilding data types using subarray command\n");
#endif
  }

  /* this processor will assign every nproc'th value starting at me */
  nval = (dims-1-me)/nproc+1;

  values = (int*)malloc(nval*sizeof(int));
  ival = (int*)malloc(nval*sizeof(int));

  icnt=0;
  for (i=me; i= dims) icnt = me;
/* create s

Re: [OMPI users] MPI Datatypes and RMA

2016-04-29 Thread Gilles Gouaillardet
Bruce,

which version of OpenMPI are you using ?
out of curiosity, did you try your program with an other MPI implementation
such as MPICH or it's derivative ?
when using derived datatypes (ddt) in one sided communication, the
ddt description must be sent with the data.
two protocols are internally used
- inline for "short" description
- within a new message for "long" description
assuming your program is correct, I can guess there is a bug in the way
ddt "long" description is handled, and I will investigate that.

that being said, it is very likely MPI_Type_create_struct invoked with a
high count, will internally generate a long description, so it will always
be suboptimal compared to MPI_Type_create_subarray, or other subroutine
that can be used because of the "regular shape" of your ddt.

Cheers,

Gilles

On Saturday, April 30, 2016, Palmer, Bruce J  wrote:

> I’ve been trying to recreate the semantics of the Global Array gather and
> scatter operations using MPI RMA routines and I’ve run into some issues
> with MPI Datatypes. I’ve been focusing on building MPI versions of the GA
> gather and scatter calls, which I’ve been trying to implement using MPI
> data types built with the MPI_Type_create_struct call. I’ve developed a
> test program that simulates copying data into and out of a 1D distributed
> array of size NSIZE. Each processor contains a segment of approximately
> size NSIZE/nproc and is responsible for assigning every nprocth value in
> the array starting with the value indexed by the rank of the array. After
> assigning values and synchronizing the distributed data structure, each
> processor then reads the values set by the processor of next higher rank
> (the process with rank nproc-1 reads the values set by process 0).
>
>
>
> The distributed array is represented by and MPI window and created using a
> standard MPI_Win_create call. The values in the array are set and read
> using MPI RMA operations, either MPI_Get/MPI_Put or MPI_Rget/MPI_Rput.
> Three different protocols have been used. The first is to call MPI_Win_lock
> and create a shared lock on the remote processor, then call MPI_Put/MPI_Get
> and then call MPI_Win_unlock to clear the lock. The second protocol is to
> use MPI request-based calls. After the call to MPI_Win_create,
> MPI_Win_lock_all is called to start a passive synchronization epoch on the
> window. Data is written and read to the distributed array using
> MPI_Rput/MPI_Rget immediately followed by a call to MPI_Wait, using the
> handle returned by the MPI_Rput/MPI_Rget call. The third protocol also
> immediately creates a passive synchronization epoch after window creation,
> but uses calls to MPI_Put/MPI_Get immediately followed by a call to
> MPI_Win_flush_local. These three protocols seem to cover all the
> possibilities that I have seen in other MPI/RMA based implementations of
> ARMCI/GA.
>
>
>
> The issue that I’ve run into is that these tests seem to work reliably if
> I build the data type using the MPI_Type_create_subbarray function but fail
> for larger arrays (NSIZE ~ 1) when I use MPI_Type_create_struct.
> Because the values being set by each processor are evenly spaced, I can use
> either function in this case (this is not generally true in applications).
> The struct data type hangs on 2 processors using lock/unlock, crashes for
> the request-based protocol and does not get the correct values in the Get
> phase of the data transfer when using flush_local. These tests are done on
> a Linux cluster using an Infiniband interconnect and the value of NSIZE is
> 1. For comparison, the same test using MPI_Type_create_subarray seems
> to function reliably for all three protocols for NSIZE=100 using 1,2,8
> processors on 1 and 2 SMP nodes.
>
>
>
> I’ve attached the test program for these test cases. Does anyone have a
> suggestion about what might be going on here?
>
>
>
> Bruce
>