Hi

I have a problem with groups and communicators in openmpi-1.9a1r27787
with Java. I want to multiply two matrices with any number of
processes. I build a new group, if I start more than n processes
and I use all processes, if I start at most n processes.

My program contains the following code.

...
      /* Create group "groupWorker"                                     */
      groupWorker = groupCommWorld.Incl (group_w_mem);
    }
    else
    {
      /* there are at most as many processes as rows in matrix "a",
       * i.e., we can use the "basic group"
       */
      groupWorker = groupCommWorld;
    }
    /* Create group "groupOther" which demonstrates only how to use
     * another group operation and which has nothing to do in this
     * program.
     */
    groupOther = Group.Difference (groupCommWorld, groupWorker);
    if (groupOther == MPI.GROUP_EMPTY)
    {
      System.out.println ("groupOther is empty.");
    }
    else
    {
      System.out.println ("groupOther is not empty.");
    }

    groupCommWorld.finalize ();
    /* Create communicators for both groups. The communicator is only
     * defined for all processes of the group and it is undefined
     * (MPI.COMM_NULL) for all other processes.
     */
    COMM_WORKER = MPI.COMM_WORLD.Creat (groupWorker);
    COMM_OTHER  = MPI.COMM_WORLD.Creat (groupOther);
...


Shouldn't "MPI.COMM_WORLD.Creat" be "MPI.COMM_WORLD.Create"?
"groupOther" should be empty, if I use "-np 4". Unfortunately it isn't.

tyr java 112 ompi_info | grep "Open MPI:"
                Open MPI: 1.9a1r27787
tyr java 113 mpijavac MatMultWithAnyProc2DarrayIn1DarrayMain.java
tyr java 114 mpiexec -np 4 java MatMultWithAnyProc2DarrayIn1DarrayMain
groupOther is not empty.
[tyr:25128] *** An error occurred in MPI_Comm_create
[tyr:25128] *** reported by process [3288334337,0]
[tyr:25128] *** on communicator MPI_COMM_WORLD
[tyr:25128] *** MPI_ERR_GROUP: invalid group
[tyr:25128] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[tyr:25128] ***    and potentially your MPI job)
...


Everything works fine, if I use "-np 6". I have removed some lines,
so that the output is more readable.

tyr java 115 mpiexec -np 6 java MatMultWithAnyProc2DarrayIn1DarrayMain
groupOther is not empty.

(4,6)-matrix a:

      1.00      2.00      3.00      4.00      5.00      6.00
      7.00      8.00      9.00     10.00     11.00     12.00
     13.00     14.00     15.00     16.00     17.00     18.00
     19.00     20.00     21.00     22.00     23.00     24.00

(6,8)-matrix b:

     48.00     47.00     46.00     45.00     44.00     43.00     42.00     41.00
     40.00     39.00     38.00     37.00     36.00     35.00     34.00     33.00
     32.00     31.00     30.00     29.00     28.00     27.00     26.00     25.00
     24.00     23.00     22.00     21.00     20.00     19.00     18.00     17.00
     16.00     15.00     14.00     13.00     12.00     11.00     10.00      9.00
      8.00      7.00      6.00      5.00      4.00      3.00      2.00      1.00

(4,8)-result-matrix c = a * b:

    448.00    427.00    406.00    385.00    364.00    343.00    322.00    301.00
   1456.00   1399.00   1342.00   1285.00   1228.00   1171.00   1114.00   1057.00
   2464.00   2371.00   2278.00   2185.00   2092.00   1999.00   1906.00   1813.00
   3472.00   3343.00   3214.00   3085.00   2956.00   2827.00   2698.00   2569.00



It seems that I'm not allowed to do

groupWorker = groupCommWorld;
...
groupOther = Group.Difference (groupCommWorld, groupWorker);

or that Group.Difference doesn't return MPI.GROUP_EMPTY.



I have a similar program in C which also doesn't work with Open MPI
(I get the same error for openmpi-1.6.4 and 1.9).

tyr strided_vector 109 ompi_info | grep "Open MPI:"
                Open MPI: 1.6.4a1r27643

tyr strided_vector 108 ompi_info | grep "Open MPI:"
                Open MPI: 1.9a1r27787


tyr strided_vector 108 mpiexec -np 4 data_type_4
Process 0 of 4 running on tyr.informatik.hs-fulda.de
Process 1 of 4 running on tyr.informatik.hs-fulda.de
Process 2 of 4 running on tyr.informatik.hs-fulda.de
Process 3 of 4 running on tyr.informatik.hs-fulda.de

original matrix:

     1     2     3     4     5     6     7     8     9    10
    11    12    13    14    15    16    17    18    19    20
    21    22    23    24    25    26    27    28    29    30
    31    32    33    34    35    36    37    38    39    40
    41    42    43    44    45    46    47    48    49    50
    51    52    53    54    55    56    57    58    59    60

result matrix:
  elements are sqared in columns:
     0   1   2   6   7
  elements are multiplied with 2 in columns:
     3   4   5   8   9


     1     4     9     8    10    12    49    64    18    20
   121   144   169    28    30    32   289   324    38    40
   441   484   529    48    50    52   729   784    58    60
   961  1024  1089    68    70    72  1369  1444    78    80
  1681  1764  1849    88    90    92  2209  2304    98   100
  2601  2704  2809   108   110   112  3249  3364   118   120

Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (comm->c_remote_group)
)->obj_magic_id, file ../../openmpi-1.6.4a1r27643/ompi/communicator/comm_init.c,
 line 412
[tyr:24415] *** Process received signal ***
Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (comm->c_remote_group)
)->obj_magic_id, file ../../openmpi-1.6.4a1r27643/ompi/communicator/comm_init.c,
 line 412
[tyr:24415] Signal: Abort (6)
[tyr:24415] Signal code:  (-1)
...



The program works as expected, if I use LAM-MPI.

tyr strided_vector 115 lamboot

LAM 6.5.9/MPI 2 C++ - Indiana University

tyr strided_vector 116 mpirun -np 4 data_type_4
Process 0 of 4 running on tyr.informatik.hs-fulda.de
Process 1 of 4 running on tyr.informatik.hs-fulda.de
Process 2 of 4 running on tyr.informatik.hs-fulda.de
Process 3 of 4 running on tyr.informatik.hs-fulda.de

original matrix:

     1     2     3     4     5     6     7     8     9    10
    11    12    13    14    15    16    17    18    19    20
    21    22    23    24    25    26    27    28    29    30
    31    32    33    34    35    36    37    38    39    40
    41    42    43    44    45    46    47    48    49    50
    51    52    53    54    55    56    57    58    59    60

result matrix:
  elements are sqared in columns:
     0   1   2   6   7
  elements are multiplied with 2 in columns:
     3   4   5   8   9


     1     4     9     8    10    12    49    64    18    20
   121   144   169    28    30    32   289   324    38    40
   441   484   529    48    50    52   729   784    58    60
   961  1024  1089    68    70    72  1369  1444    78    80
  1681  1764  1849    88    90    92  2209  2304    98   100
  2601  2704  2809   108   110   112  3249  3364   118   120

tyr strided_vector 117 lamhalt

LAM 6.5.9/MPI 2 C++ - Indiana University


I would be grateful, if somebody can fix the problems in Open MPI.
Thank you very much for any help in advance.


Kind regards

Siegmar

Reply via email to