Shared memory doesn't extend between comm_spawn'd parent/child processes in 
Open MPI. Perhaps someday it will, but not yet.


On Jan 19, 2010, at 1:14 PM, Nicolas Bock wrote:

> Hello list,
> 
> I think I understand better now what's happening, although I still don't know 
> why. I have attached two small C codes that demonstrate the problem. The code 
> in main.c uses MPI_Comm_spawn() to start the code in the second source, 
> child.c. I can force the issue by running the main.c code with
> 
> mpirun -mca btl self,sm -np 1 ./main
> 
> and get this error:
> 
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[26121,2],0]) is on host: mujo
>   Process 2 ([[26121,1],0]) is on host: mujo
>   BTLs attempted: self sm
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> 
> Is that because the spawned process is in a different group? They are still 
> all running on the same host, so at least in principle they should be able to 
> communicate with each other via shared memory.
> 
> nick
> 
> 
> 
> On Fri, Jan 15, 2010 at 16:08, Eugene Loh <eugene....@sun.com> wrote:
> Dunno.  Do lower np values succeed?  If so, at what value of np does the job 
> no longer start?
> 
> Perhaps it's having a hard time creating the shared-memory backing file in 
> /tmp.  I think this is a 64-Mbyte file.  If this is the case, try reducing 
> the size of the shared area per this FAQ item:  
> http://www.open-mpi.org/faq/?category=sm#decrease-sm  Most notably, reduce 
> mpool_sm_min_size below 67108864.
> 
> Also note trac ticket 2043, which describes problems with the sm BTL exposed 
> by GCC 4.4.x compilers.  You need to get a sufficiently recent build to solve 
> this.  But, those problems don't occur until you start passing messages, and 
> here you're not even starting up.
> 
> 
> Nicolas Bock wrote:
>> 
>> Sorry, I forgot to give more details on what versions I am using:
>> 
>> OpenMPI 1.4
>> Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
>> gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
>> 
>> On Fri, Jan 15, 2010 at 15:47, Nicolas Bock <nicolasb...@gmail.com> wrote:
>> Hello list,
>> 
>> I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores, 
>> which I can verify by looking at /proc/cpuinfo. However, when I run a job 
>> with
>> 
>> mpirun -np 16 -mca btl self,sm job
>> 
>> I get this error:
>> 
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> 
>>   Process 1 ([[56972,2],0]) is on host: rust
>>   Process 2 ([[56972,1],0]) is on host: rust
>>   BTLs attempted: self sm
>> 
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> 
>> By adding the tcp btl I can run the job. I don't understand why openmpi 
>> claims that a pair of processes can not reach each other, all processor 
>> cores should have access to all memory after all. Do I need to set some 
>> other btl limit?
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> <main.c><child.c>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to