I'm encountering an error using qsub that none of us can figure out. MPI C++ programs seem to run fine when executed from the command line, but for some reason when I submit them through the queue I get a strange error message ..
[compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission denied (13) the compute node 3-12 doesn't matter (the error can generate from any of the nodes, and I'm guessing that 3-12 is the parent node here). To check if there was some problem with my own code, I created a simple 'hello world' program (see attached files). Again, the program runs fine from the command line but fails in qsub with the same sort of error message. I have included (i) the code (ii) the job script for qsub, and (iii) the ".o" file from qsub for the "hello world" program. These don't look like MPI errors, but rather some conflict with, maybe, secure communication accross nodes. Is there something simple I can do to fix this? Thanks, Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948
#include <stdio.h> #include "/opt/openmpi/include/mpi.h" #define bufdim 128 int main(int argc, char *argv[]) { char buffer[bufdim]; char id_str[32]; // mpi : MPI::Init(argc,argv); MPI::Status status; int size; int rank; int tag; size=MPI::COMM_WORLD.Get_size(); rank=MPI::COMM_WORLD.Get_rank(); tag=0; if (rank==0) { printf("%d: we have %d processors\n",rank,size); int i; i=1; for ( ;i<size; ++i) { sprintf(buffer,"hello %d! ",i); MPI::COMM_WORLD.Send(buffer,bufdim,MPI::CHAR,i,tag); } i=1; for ( ;i<size; ++i) { MPI::COMM_WORLD.Recv(buffer,bufdim,MPI::CHAR,i,tag,status); printf("%d: %s\n",rank,buffer); } } else { MPI::COMM_WORLD.Recv(buffer,bufdim,MPI::CHAR,0,tag,status); sprintf(id_str,"processor %d ",rank); strncat(buffer,id_str,bufdim-1); strncat(buffer,"reporting for duty\n",bufdim-1); MPI::COMM_WORLD.Send(buffer,bufdim,MPI::CHAR,0,tag); } MPI::Finalize(); return 0; }
hello.job
Description: Binary data
hello.job.o5822590
Description: Binary data