[OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-06 Thread Mccall, Kurt E. (MSFC-EV41) via users
Hi,

MPI_Comm_spawn() is failing with the error message "All nodes which are 
allocated for this job are already filled".   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 2Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=
Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 0Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile "MyNodeFile":

n001.cluster.com slots=2 max_slots=2




Here is my SpawnTestManager code:


#include 
#include 
#include 

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
int rank, world_size;
char *argv2[2];
MPI_Comm mpi_comm;
MPI_Info info;
char host[MPI_MAX_PROCESSOR_NAME + 1];
int host_name_len;

string worker_cmd = "SpawnTestWorker";
string host_name = "n001.cluster.com";

argv2[0] = "dummy_arg";
argv2[1] = NULL;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

MPI_Get_processor_name(host, &host_name_len);
cout << "Host name from MPI_Get_processor_name is " << host << endl;

   char info_str[64];
sprintf(info_str, "ppr:%d:node", 1);
MPI_Info_create(&info);
MPI_Info_set(info, "host", host_name.c_str());
MPI_Info_set(info, "map-by", info_str);

MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
&mpi_comm, MPI_ERRCODES_IGNORE);
MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);

std::cout << "Manager success!" << std::endl;

MPI_Finalize();
return 0;
}




Here is my SpawnTestWorker code:


#include "/opt/openmpi_pgc_tm/include/mpi.h"
#include 

int main(int argc, char *argv[])
{
int world_size, rank;
MPI_Comm manager_intercom;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

MPI_Comm_get_parent(&manager_intercom);
MPI_Comm_set_errhandler(manager_intercom, MPI::ERRORS_THROW_EXCEPTIONS);

std::cout << "Worker success!" << std::endl;

MPI_Finalize();
return 0;
}


My config.log can be found here:  
https://gist.github.com/kmccall882/e26bc2ea58c9328162e8959b614a6fce.js

I've attached the other info requested at on the help page, except the output 
of "ompi_info -v ompi full --parsable".   My version of ompi_info doesn't 
accept the "ompi full" arguments, and the "-all" arg doesn't produce much 
output.

Thanks for your help,
Kurt









___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] silent failure for large allgather

2019-08-06 Thread Emmanuel Thomé via users
Hi,

In the attached program, the MPI_Allgather() call fails to communicate
all data (the amount it communicates wraps around at 4G...).  I'm running
on an omnipath cluster (2018 hardware), openmpi 3.1.3 or 4.0.1 (tested both).

With the OFI mtl, the failure is silent, with no error message reported.
This is very annoying.

With the PSM2 mtl, we have at least some info printed that 4G is a limit.

I have tested it with various combinations of mca parameters. It seems
that the one config bit that makes the test pass is the selection of the
ob1 pml. However I have to select it explicitly, because otherwise cm is
selected instead (priority 40 vs 20, it seems), and the program fails. I
don't know to which extent the cm pml is the root cause, or whether I'm
witnessing a side-effect of something.

openmpi-3.1.3 (debian10 package openmpi-bin-3.1.3-11):

node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node  -n 2 ./a.out
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 
0x10001 bytes: ...
Message size 4295032832 bigger than supported by PSM2 API. Max = 4294967296
MPI error returned:
MPI_ERR_OTHER: known error not in list
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 
0x10001 bytes: NOK
[node0.localdomain:14592] 1 more process has sent help message 
help-mtl-psm2.txt / message too big
[node0.localdomain:14592] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node  -n 2 --mca mtl ofi 
./a.out
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 
0x10001 bytes: ...
MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 
0x10001 bytes: NOK
node 0 failed_offset = 0x10002
node 1 failed_offset = 0x1

I attached the corresponding outputs with some mca verbose
parameters on, plus ompi_info, as well as variations of the pml layer
(ob1 works).

openmpi-4.0.1 gives essentially the same results (similar files
attached), but with various doubts on my part as to whether I've run this
check correctly. Here are my doubts:
- whether I should or not have an ucx build for an omnipath cluster
  (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),
- which btl I should use (I understand that openib goes to
  deprecation and it complains unless I do --mca btl openib --mca
  btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
  btl should I use instead ?)
- which layers matter, which ones matter less... I tinkered with btl
  pml mtl.  It's fine if there are multiple choices, but if some
  combinations lead to silent data corruption, that's not really
  cool.

Could the error reporting in this case be somehow improved ?

I'd be glad to provide more feedback if needed.

E.
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

long failed_offset = 0;

size_t chunk_size = 1 << 16;
size_t nchunks = (1 << 16) + 1;

int main(int argc, char * argv[])
{
if (argc >= 2) chunk_size = atol(argv[1]);
if (argc >= 3) nchunks = atol(argv[1]);

MPI_Init(&argc, &argv);
/*
 * This function returns:
 *  0 on success.
 *  a non-zero MPI Error code if MPI_Allgather returned one.
 *  -1 if no MPI Error code was returned, but the result of Allgather
 *  was wrong.
 *  -2 if memory allocation failed.
 *
 * (note that the MPI document guarantees that MPI error codes are
 * positive integers)
 */

int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

int err;

char * check_text;
int rc = asprintf(&check_text, "MPI_Allgather, %d nodes, 0x%zx chunks of 0x%zx bytes, total %d * 0x%zx bytes", size, nchunks, chunk_size, size, chunk_size * nchunks);
if (rc < 0) abort();

if (!rank) printf("%s: ...\n", check_text);

MPI_Datatype mpi_ft;
MPI_Type_contiguous(chunk_size, MPI_BYTE, &mpi_ft);
MPI_Type_commit(&mpi_ft);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
void * data = malloc(nchunks * size * chunk_size);
memset(data, 0, nchunks * size * chunk_size);
int alloc_ok = data != NULL;
MPI_Allreduce(MPI_IN_PLACE, &alloc_ok, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
if (alloc_ok) {
memset(((char*)data) + nchunks * chunk_size * rank, 0x42, nchunks * chunk_size);
err = MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
data, nchunks,
mpi_ft, MPI_COMM_WORLD);
if (err == 0) {
void * p = memchr(data, 0, nchunks * size * chunk_size);
if (p != NULL) {
/* We found a zero, we shouldn't ! */
err = -1;
failed_offset = ((char*)p)-(char*)data;
}
}
} else {
err = -2;
}
if (data) free(data);
MPI_Type_free(&mpi_ft);

if (!rank) {
  

Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-06 Thread Ralph Castain via users
I'm afraid I cannot replicate this problem on OMPI master, so it could be 
something different about OMPI 4.0.1 or your environment. Can you download and 
test one of the nightly tarballs from the "master" branch and see if it works 
for you?

https://www.open-mpi.org/nightly/master/

Ralph


On Aug 6, 2019, at 3:58 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:

Hi,
 MPI_Comm_spawn() is failing with the error message “All nodes which are 
allocated for this job are already filled”.   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?
 For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:
 
https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn
Here is the full error message.   Note the Max Slots: 0 message therein (?):
 Data for JOB [39020,1] offset 0 Total slots allocated 22
    JOB MAP   
 Data for node: n001    Num slots: 2    Max slots: 2    Num procs: 1
    Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A
 =
Data for JOB [39020,1] offset 0 Total slots allocated 22
    JOB MAP   
 Data for node: n001    Num slots: 2    Max slots: 0Num procs: 1
    Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
 =
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***    and potentially your MPI job)
Here is my mpiexec command:
 mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager
Here is my hostfile “MyNodeFile”:
 n001.cluster.com   slots=2 max_slots=2
Here is my SpawnTestManager code:
 
#include 
#include 
#include 
 #ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"
 using std::string;
using std::cout;
using std::endl;
 int main(int argc, char *argv[])
{
    int rank, world_size;
    char *argv2[2];
    MPI_Comm mpi_comm;
    MPI_Info info;
    char host[MPI_MAX_PROCESSOR_NAME + 1];
    int host_name_len;
 string worker_cmd = "SpawnTestWorker";
    string host_name = "n001.cluster.com  ";
 argv2[0] = "dummy_arg";
    argv2[1] = NULL;
 MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
 MPI_Get_processor_name(host, &host_name_len);
    cout << "Host name from MPI_Get_processor_name is " << host << endl;
    char info_str[64];
    sprintf(info_str, "ppr:%d:node", 1);
    MPI_Info_create(&info);
    MPI_Info_set(info, "host", host_name.c_str());
    MPI_Info_set(info, "map-by", info_str);
 MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
    &mpi_comm, MPI_ERRCODES_IGNORE);
    MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);
 std::cout << "Manager success!" << std::endl;
 MPI_Finalize();
    return 0;
}

   Here is my SpawnTestWorker code:
 
#include "/opt/openmpi_pgc_tm/include/mpi.h"
#include 
 int main(int argc, char *argv[])
{
    int world_size, rank;
    MPI_Comm manager_intercom;
 MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
 MPI_Comm_get_parent(&manager_intercom);
    MPI_Comm_set_errhandler(manager_intercom, MPI::ERRORS_THROW_EXCEPTIONS);
 std::cout << "Worker success!" << std::endl;
 MPI_Finalize();
    return 0;
}

 My config.log can be found here:  
https://gist.github.com/kmccall882/e26bc2ea58c9328162e8959b614a6fce.js
 I’ve attached the other info requested at on the help page, except the output 
of "ompi_info -v ompi full --parsable".   My version of ompi_info doesn’t 
accept the “ompi full” arguments, and the “-all” arg doesn’t produce much 
output.
 Thanks for your help,
Kurt
 _