Hi I have a problem with groups and communicators in openmpi-1.9a1r27787 with Java. I want to multiply two matrices with any number of processes. I build a new group, if I start more than n processes and I use all processes, if I start at most n processes.
My program contains the following code. ... /* Create group "groupWorker" */ groupWorker = groupCommWorld.Incl (group_w_mem); } else { /* there are at most as many processes as rows in matrix "a", * i.e., we can use the "basic group" */ groupWorker = groupCommWorld; } /* Create group "groupOther" which demonstrates only how to use * another group operation and which has nothing to do in this * program. */ groupOther = Group.Difference (groupCommWorld, groupWorker); if (groupOther == MPI.GROUP_EMPTY) { System.out.println ("groupOther is empty."); } else { System.out.println ("groupOther is not empty."); } groupCommWorld.finalize (); /* Create communicators for both groups. The communicator is only * defined for all processes of the group and it is undefined * (MPI.COMM_NULL) for all other processes. */ COMM_WORKER = MPI.COMM_WORLD.Creat (groupWorker); COMM_OTHER = MPI.COMM_WORLD.Creat (groupOther); ... Shouldn't "MPI.COMM_WORLD.Creat" be "MPI.COMM_WORLD.Create"? "groupOther" should be empty, if I use "-np 4". Unfortunately it isn't. tyr java 112 ompi_info | grep "Open MPI:" Open MPI: 1.9a1r27787 tyr java 113 mpijavac MatMultWithAnyProc2DarrayIn1DarrayMain.java tyr java 114 mpiexec -np 4 java MatMultWithAnyProc2DarrayIn1DarrayMain groupOther is not empty. [tyr:25128] *** An error occurred in MPI_Comm_create [tyr:25128] *** reported by process [3288334337,0] [tyr:25128] *** on communicator MPI_COMM_WORLD [tyr:25128] *** MPI_ERR_GROUP: invalid group [tyr:25128] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [tyr:25128] *** and potentially your MPI job) ... Everything works fine, if I use "-np 6". I have removed some lines, so that the output is more readable. tyr java 115 mpiexec -np 6 java MatMultWithAnyProc2DarrayIn1DarrayMain groupOther is not empty. (4,6)-matrix a: 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 (6,8)-matrix b: 48.00 47.00 46.00 45.00 44.00 43.00 42.00 41.00 40.00 39.00 38.00 37.00 36.00 35.00 34.00 33.00 32.00 31.00 30.00 29.00 28.00 27.00 26.00 25.00 24.00 23.00 22.00 21.00 20.00 19.00 18.00 17.00 16.00 15.00 14.00 13.00 12.00 11.00 10.00 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 (4,8)-result-matrix c = a * b: 448.00 427.00 406.00 385.00 364.00 343.00 322.00 301.00 1456.00 1399.00 1342.00 1285.00 1228.00 1171.00 1114.00 1057.00 2464.00 2371.00 2278.00 2185.00 2092.00 1999.00 1906.00 1813.00 3472.00 3343.00 3214.00 3085.00 2956.00 2827.00 2698.00 2569.00 It seems that I'm not allowed to do groupWorker = groupCommWorld; ... groupOther = Group.Difference (groupCommWorld, groupWorker); or that Group.Difference doesn't return MPI.GROUP_EMPTY. I have a similar program in C which also doesn't work with Open MPI (I get the same error for openmpi-1.6.4 and 1.9). tyr strided_vector 109 ompi_info | grep "Open MPI:" Open MPI: 1.6.4a1r27643 tyr strided_vector 108 ompi_info | grep "Open MPI:" Open MPI: 1.9a1r27787 tyr strided_vector 108 mpiexec -np 4 data_type_4 Process 0 of 4 running on tyr.informatik.hs-fulda.de Process 1 of 4 running on tyr.informatik.hs-fulda.de Process 2 of 4 running on tyr.informatik.hs-fulda.de Process 3 of 4 running on tyr.informatik.hs-fulda.de original matrix: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 result matrix: elements are sqared in columns: 0 1 2 6 7 elements are multiplied with 2 in columns: 3 4 5 8 9 1 4 9 8 10 12 49 64 18 20 121 144 169 28 30 32 289 324 38 40 441 484 529 48 50 52 729 784 58 60 961 1024 1089 68 70 72 1369 1444 78 80 1681 1764 1849 88 90 92 2209 2304 98 100 2601 2704 2809 108 110 112 3249 3364 118 120 Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (comm->c_remote_group) )->obj_magic_id, file ../../openmpi-1.6.4a1r27643/ompi/communicator/comm_init.c, line 412 [tyr:24415] *** Process received signal *** Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (comm->c_remote_group) )->obj_magic_id, file ../../openmpi-1.6.4a1r27643/ompi/communicator/comm_init.c, line 412 [tyr:24415] Signal: Abort (6) [tyr:24415] Signal code: (-1) ... The program works as expected, if I use LAM-MPI. tyr strided_vector 115 lamboot LAM 6.5.9/MPI 2 C++ - Indiana University tyr strided_vector 116 mpirun -np 4 data_type_4 Process 0 of 4 running on tyr.informatik.hs-fulda.de Process 1 of 4 running on tyr.informatik.hs-fulda.de Process 2 of 4 running on tyr.informatik.hs-fulda.de Process 3 of 4 running on tyr.informatik.hs-fulda.de original matrix: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 result matrix: elements are sqared in columns: 0 1 2 6 7 elements are multiplied with 2 in columns: 3 4 5 8 9 1 4 9 8 10 12 49 64 18 20 121 144 169 28 30 32 289 324 38 40 441 484 529 48 50 52 729 784 58 60 961 1024 1089 68 70 72 1369 1444 78 80 1681 1764 1849 88 90 92 2209 2304 98 100 2601 2704 2809 108 110 112 3249 3364 118 120 tyr strided_vector 117 lamhalt LAM 6.5.9/MPI 2 C++ - Indiana University I would be grateful, if somebody can fix the problems in Open MPI. Thank you very much for any help in advance. Kind regards Siegmar