Related to this or not, I also get a hang on MVAPICH2 2.3 compiled with GCC 8.2, but on t_filters_parallel, not t_mpi. With that combo, though, I get a segfault, or at least a message about one. It’s only “Alarm clock” on the GCC 4.8 with OpenMPI 3.1.3 combo. It also happens at the ~20 minute mark, FWIW.
Testing t_filters_parallel ============================ t_filters_parallel Test Log ============================ srun: job 84117363 queued and waiting for resources srun: job 84117363 has been allocated resources [slepner063.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) srun: error: slepner063: task 0: Segmentation fault srun: error: slepner063: tasks 1-3: Alarm clock 0.01user 0.01system 20:01.44elapsed 0%CPU (0avgtext+0avgdata 5144maxresident)k 0inputs+0outputs (0major+1524minor)pagefaults 0swaps make[4]: *** [t_filters_parallel.chkexe_] Error 1 make[4]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make[3]: *** [build-check-p] Error 1 make[3]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make[2]: *** [test] Error 2 make[2]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make[1]: *** [check-am] Error 2 make[1]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make: *** [check-recursive] Error 1 > On Feb 21, 2019, at 3:03 PM, Gabriel, Edgar <egabr...@central.uh.edu> wrote: > > Yes, I was talking about the same thing, although for me it was not t_mpi, > but t_shapesame that was hanging. It might be an indication of the same issue > however. > >> -----Original Message----- >> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan >> Novosielski >> Sent: Thursday, February 21, 2019 1:59 PM >> To: Open MPI Users <users@lists.open-mpi.org> >> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >> 3.1.3 >> >> >>> On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar <egabr...@central.uh.edu> >> wrote: >>> >>>> -----Original Message----- >>>>> Does it always occur at 20+ minutes elapsed ? >>>> >>>> Aha! Yes, you are right: every time it fails, it’s at the 20 minute >>>> and a couple of seconds mark. For comparison, every time it runs, it >>>> runs for 2-3 seconds total. So it seems like what might actually be >>>> happening here is a hang, and not a failure of the test per se. >>>> >>> >>> I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 >> (although this was OpenSuSE, not Redhat), and it looked to me like one of >> tests were hanging, but I didn't have time to investigate it further. >> >> Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The >> OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t >> believe it ever launches any jobs or anything like that. >> >> -- >> ____ >> || \\UTGERS, >> |---------------------------*O*--------------------------- >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >> || \\ of NJ | Office of Advanced Research Computing - MSB C630, >> Newark >> `' > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users