Hello list,

I think I understand better now what's happening, although I still don't
know why. I have attached two small C codes that demonstrate the problem.
The code in main.c uses MPI_Comm_spawn() to start the code in the second
source, child.c. I can force the issue by running the main.c code with

mpirun -mca btl self,sm -np 1 ./main

and get this error:

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[26121,2],0]) is on host: mujo
  Process 2 ([[26121,1],0]) is on host: mujo
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

Is that because the spawned process is in a different group? They are still
all running on the same host, so at least in principle they should be able
to communicate with each other via shared memory.

nick



On Fri, Jan 15, 2010 at 16:08, Eugene Loh <eugene....@sun.com> wrote:

>  Dunno.  Do lower np values succeed?  If so, at what value of np does the
> job no longer start?
>
> Perhaps it's having a hard time creating the shared-memory backing file in
> /tmp.  I think this is a 64-Mbyte file.  If this is the case, try reducing
> the size of the shared area per this FAQ item:
> http://www.open-mpi.org/faq/?category=sm#decrease-sm  Most notably, reduce
> mpool_sm_min_size below 67108864.
>
> Also note trac ticket 2043, which describes problems with the sm BTL
> exposed by GCC 4.4.x compilers.  You need to get a sufficiently recent build
> to solve this.  But, those problems don't occur until you start passing
> messages, and here you're not even starting up.
>
>
> Nicolas Bock wrote:
>
> Sorry, I forgot to give more details on what versions I am using:
>
> OpenMPI 1.4
> Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
> gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
>
> On Fri, Jan 15, 2010 at 15:47, Nicolas Bock <nicolasb...@gmail.com> wrote:
>
>> Hello list,
>>
>> I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores,
>> which I can verify by looking at /proc/cpuinfo. However, when I run a job
>> with
>>
>> mpirun -np 16 -mca btl self,sm job
>>
>> I get this error:
>>
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>   Process 1 ([[56972,2],0]) is on host: rust
>>   Process 2 ([[56972,1],0]) is on host: rust
>>   BTLs attempted: self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>>
>> By adding the tcp btl I can run the job. I don't understand why openmpi
>> claims that a pair of processes can not reach each other, all processor
>> cores should have access to all memory after all. Do I need to set some
>> other btl limit?
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int
main (int argc, char **argv)
{
  int rank;
  int error_codes[1];
  char buffer[1];
  MPI_Comm intercomm;
  MPI_Status status;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (rank == 0)
  {
    printf("[master] spawning process\n");
    MPI_Comm_spawn("./other", argv, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, error_codes);

    /* Wait for children to finish. */
    MPI_Recv(buffer, 1, MPI_CHAR, MPI_ANY_SOURCE, 1, intercomm, &status);
  }

  printf("[master (%i)] waiting at barrier\n", rank);
  MPI_Barrier(MPI_COMM_WORLD);
  printf("[master (%i)] done\n", rank);

  MPI_Finalize();
}
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int
main (int argc, char **argv)
{
  int rank;
  char buffer[1];
  MPI_Comm parent;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_get_parent(&parent);

  printf("[slave (%i)] starting up, sleeping...\n", rank);
  sleep(5);
  printf("[slave (%i)] done sleeping, signalling master\n", rank);
  MPI_Send(buffer, 1, MPI_CHAR, 0, 1, parent);

  MPI_Finalize();
}

Reply via email to