I am attempting to split my application into multiple master+workers groups using MPI_COMM_split. My MPI revision is shown as:
mpirun --tag-output ompi_info -v ompi full --parsable [1,0]<stdout>:package:Open MPI root@build-x86-64 Distribution [1,0]<stdout>:ompi:version:full:1.4.3 [1,0]<stdout>:ompi:version:svn:r23834 [1,0]<stdout>:ompi:version:release_date:Oct 05, 2010 [1,0]<stdout>:orte:version:full:1.4.3 [1,0]<stdout>:orte:version:svn:r23834 [1,0]<stdout>:orte:version:release_date:Oct 05, 2010 [1,0]<stdout>:opal:version:full:1.4.3 [1,0]<stdout>:opal:version:svn:r23834 [1,0]<stdout>:opal:version:release_date:Oct 05, 2010 [1,0]<stdout>:ident:1.4.3 The basic problem I am having is that none of processor instances ever returns from the MPI_COMM_split call. I am pretty new to MPI and it is likely I am not doing things quite correctly. I'd appreciate some guidance. I am working with an application that has functioned nicely for a while now. It only uses a single MPI_COMM_WORLD communicator. It is standard stuff: a master that hands out tasks to many workers, receives output and keeps track of workers that are ready to receive another task. The tasks are quite compute-intensive. When running a variation of the process that uses Monte Carlo iterations, jobs can exceed the 30 hours they are limited to. The MC iterations are independent of each other - adding random noise to an input - so I would like to run multiple iterations simultaneously so that 4 times the cores runs in a fourth of the time. This would entail a supervisor interacting with multiple master+workers groups. I had thought that I would just have to declare a communicator for each group so that broadcasts and syncs would work within a single group. MPI_Comm_size( MPI_COMM_WORLD, &total_proc_count ); MPI_Comm_rank( MPI_COMM_WORLD, &my_rank ); ... cores_per_group = total_proc_count / groups_count; my_group = my_rank / cores_per_group; // e.g., 0, 1, ... group_rank = my_rank - my_group * cores_per_group; // rank within a group if ( my_rank == 0 ) continue; // Do not create group for supervisor MPI_Comm oldcomm = MPI_COMM_WORLD; MPI_Comm my_communicator; // Actually declared as a class variable int sstat = MPI_Comm_split( oldcomm, my_group, group_rank, &my_communicator ); There is never a return from the above _split() call. Do I need to do something else to set this up? I would have expected perhaps a non-zero status return, but not that I would get no return at all. I would appreciate any comments or guidance. - Gary