> -----Original Message----- > > Does it always occur at 20+ minutes elapsed ? > > Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a > couple > of seconds mark. For comparison, every time it runs, it runs for 2-3 seconds > total. So it seems like what might actually be happening here is a hang, and > not a failure of the test per se. >
I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 (although this was OpenSuSE, not Redhat), and it looked to me like one of tests were hanging, but I didn't have time to investigate it further. Thanks Edgar > > Is there some mechanism that automatically kills a job if it does not write > anything to stdout for some time ? > > > > A quick way to rule that out is to > > > > srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800 > > > > and see if that completes or get killed with the same error message. > > I was not aware of anything like that, but I’ll look into it now (running your > suggestion). I guess we don’t run across this sort of thing very often — most > stuff at least prints output when it starts. > > > You can also run use mpirun instead of srun, and even run mpirun > > outside of slurm > > > > (if your cluster policy allows it, you can for example use mpirun and > > run on the frontend node) > > I’m on the team that manages the cluster, so we can try various things. Every > piece of software we ever run, though, runs via srun — we don’t provide > mpirun as a matter of course, except in some corner cases. > > > On 2/21/2019 3:01 AM, Ryan Novosielski wrote: > >> Does it make any sense that it seems to work fine when OpenMPI and > HDF5 are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with > RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 > build, I did try an XFS filesystem and it didn’t help. GPFS works fine for > either > of the 7.4 and 8.2 builds. > >> > >> Just as a reminder, since it was reasonably far back in the thread, what > I’m doing is running the “make check” tests in HDF5 1.10.4, in part because > users use it, but also because it seems to have a good test suite and I can > therefore verify the compiler and MPI stack installs. I get very little > information, apart from it not working and getting that “Alarm clock” > message. > >> > >> I originally suspected I’d somehow built some component of this with a > host-specific optimization that wasn’t working on some compute nodes. But I > controlled for that and it didn’t seem to make any difference. > >> > >> -- > >> ____ > >> || \\UTGERS, > >> |---------------------------*O*--------------------------- > >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu > >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > >> || \\ of NJ | Office of Advanced Research Computing - MSB C630, > Newark > >> `' > >> > >>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski <novos...@rutgers.edu> > wrote: > >>> > >>> It didn’t work any better with XFS, as it happens. Must be something > else. I’m going to test some more and see if I can narrow it down any, as it > seems to me that it did work with a different compiler. > >>> > >>> -- > >>> ____ > >>> || \\UTGERS, > >>> |---------------------------*O*--------------------------- > >>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu > >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > >>> || \\ of NJ | Office of Advanced Research Computing - MSB > C630, Newark > >>> `' > >>> > >>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar > <egabr...@central.uh.edu> wrote: > >>>> > >>>> While I was working on something else, I let the tests run with Open > MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1 > release), and here is what I found for the HDF5 1.10.4 tests on my local > desktop: > >>>> > >>>> In the testpar directory, there is in fact one test that fails for both > ompio and romio321 in exactly the same manner. > >>>> I used 6 processes as you did (although I used mpirun directly instead > of srun...) From the 13 tests in the testpar directory, 12 pass correctly > (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, > t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). > >>>> > >>>> The one tests that officially fails ( t_pflush1) actually reports that it > passed, but then throws message that indicates that MPI_Abort has been > called, for both ompio and romio. I will try to investigate this test to see > what > is going on. > >>>> > >>>> That being said, your report shows an issue in t_mpi, which passes > without problems for me. This is however not GPFS, this was an XFS local file > system. Running the tests on GPFS are on my todo list as well. > >>>> > >>>> Thanks > >>>> Edgar > >>>> > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > >>>>> Gabriel, Edgar > >>>>> Sent: Sunday, February 17, 2019 10:34 AM > >>>>> To: Open MPI Users <users@lists.open-mpi.org> > >>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems > >>>>> w/OpenMPI > >>>>> 3.1.3 > >>>>> > >>>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I > >>>>> have access to a GPFS file system since recently, and will report > >>>>> back on that, but it will take a few days. > >>>>> > >>>>> Thanks > >>>>> Edgar > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf > >>>>>> Of Ryan Novosielski > >>>>>> Sent: Sunday, February 17, 2019 2:37 AM > >>>>>> To: users@lists.open-mpi.org > >>>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems > >>>>>> w/OpenMPI > >>>>>> 3.1.3 > >>>>>> > >>>>>> -----BEGIN PGP SIGNED MESSAGE----- > >>>>>> Hash: SHA1 > >>>>>> > >>>>>> This is on GPFS. I'll try it on XFS to see if it makes any difference. > >>>>>> > >>>>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: > >>>>>>> Ryan, > >>>>>>> > >>>>>>> What filesystem are you running on ? > >>>>>>> > >>>>>>> Open MPI defaults to the ompio component, except on Lustre > >>>>>>> filesystem where ROMIO is used. (if the issue is related to > >>>>>>> ROMIO, that can explain why you did not see any difference, in > >>>>>>> that case, you might want to try an other filesystem (local > >>>>>>> filesystem or NFS for example)\ > >>>>>>> > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Gilles > >>>>>>> > >>>>>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski > >>>>>>> <novos...@rutgers.edu> wrote: > >>>>>>>> I verified that it makes it through to a bash prompt, but I’m a > >>>>>>>> little less confident that something make test does doesn’t clear it. > >>>>>>>> Any recommendation for a way to verify? > >>>>>>>> > >>>>>>>> In any case, no change, unfortunately. > >>>>>>>> > >>>>>>>> Sent from my iPhone > >>>>>>>> > >>>>>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar > >>>>>>>>> <egabr...@central.uh.edu> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> What file system are you running on? > >>>>>>>>> > >>>>>>>>> I will look into this, but it might be later next week. I just > >>>>>>>>> wanted to emphasize that we are regularly running the parallel > >>>>>>>>> hdf5 tests with ompio, and I am not aware of any outstanding > >>>>>>>>> items that do not work (and are supposed to work). That being > >>>>>>>>> said, I run the tests manually, and not the 'make test' > >>>>>>>>> commands. Will have to check which tests are being run by that. > >>>>>>>>> > >>>>>>>>> Edgar > >>>>>>>>> > >>>>>>>>>> -----Original Message----- From: users > >>>>>>>>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > >>>>>>>>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: > >>>>>>>>>> Open MPI Users <users@lists.open-mpi.org> Subject: Re: > >>>>>>>>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > >>>>>>>>>> 3.1.3 > >>>>>>>>>> > >>>>>>>>>> Ryan, > >>>>>>>>>> > >>>>>>>>>> Can you > >>>>>>>>>> > >>>>>>>>>> export OMPI_MCA_io=^ompio > >>>>>>>>>> > >>>>>>>>>> and try again after you made sure this environment variable > >>>>>>>>>> is passed by srun to the MPI tasks ? > >>>>>>>>>> > >>>>>>>>>> We have identified and fixed several issues specific to the > >>>>>>>>>> (default) ompio component, so that could be a valid > >>>>>>>>>> workaround until the next release. > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> > >>>>>>>>>> Gilles > >>>>>>>>>> > >>>>>>>>>> Ryan Novosielski <novos...@rutgers.edu> wrote: > >>>>>>>>>>> Hi there, > >>>>>>>>>>> > >>>>>>>>>>> Honestly don’t know which piece of this puzzle to look at or > >>>>>>>>>>> how to get more > >>>>>>>>>> information for troubleshooting. I successfully built HDF5 > >>>>>>>>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running > >>>>>>>>>> the “make check” in HDF5 is failing at the below point; I am > >>>>>>>>>> using a value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t > >>>>>>>>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly > >>>>>>>>>> configured. > >>>>>>>>>>> Thanks for any help you can provide. > >>>>>>>>>>> > >>>>>>>>>>> make[4]: Entering directory > >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' > >>>>>>>>>>> ============================ Testing t_mpi > >>>>>>>>>>> ============================ t_mpi Test Log > >>>>>>>>>>> ============================ srun: job 84126610 queued > and > >>>>>> waiting > >>>>>>>>>>> for resources srun: job 84126610 has been allocated > >>>>>>>>>>> resources > >>>>>>>>>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user > >>>>>>>>>>> 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata > >>>>>>>>>>> 5152maxresident)k 0inputs+0outputs > >>>>> (0major+1529minor)pagefaults > >>>>>>>>>>> 0swaps make[4]: *** [t_mpi.chkexe_] Error 1 make[4]: Leaving > >>>>>>>>>>> directory > >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' > >>>>>>>>>>> make[3]: *** [build-check-p] Error 1 make[3]: Leaving > >>>>>>>>>>> directory > >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' > >>>>>>>>>>> make[2]: *** [test] Error 2 make[2]: Leaving directory > >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' > >>>>>>>>>>> make[1]: *** [check-am] Error 2 make[1]: Leaving directory > >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' > >>>>>>>>>>> make: *** [check-recursive] Error 1 > >>>>>>>>>>> > >>>>>>>>>>> -- ____ || \\UTGERS, > >>>>>>>>>>> |---------------------------*O*--------------------------- > >>>>>>>>>>> ||_// the State | Ryan Novosielski - > >>>>>>>>>>> novos...@rutgers.edu || \\ University | Sr. Technologist - > >>>>>>>>>>> 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | > >>>>>>>>>>> Office of Advanced Research Computing - MSB C630, Newark `' > >>>>>>>>>> _______________________________________________ users > >>>>> mailing > >>>>>> list > >>>>>>>>>> users@lists.open-mpi.org > >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>>>> _______________________________________________ users > mailing > >>>>>> list > >>>>>>>>> users@lists.open-mpi.org > >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>>> _______________________________________________ users > mailing > >>>>> list > >>>>>>>> users@lists.open-mpi.org > >>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>> _______________________________________________ users > mailing > >>>>>>> list users@lists.open-mpi.org > >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>> > >>>>>> - -- > >>>>>> ____ > >>>>>> || \\UTGERS, |----------------------*O*------------------------ > >>>>>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu > >>>>>> || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus > >>>>>> || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark > >>>>>> `' > >>>>>> -----BEGIN PGP SIGNATURE----- > >>>>>> > >>>>>> > >>>>> > iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXGkdJQAKCRCZv6Bp > >>>>>> 0Ryx > >>>>>> > >>>>> > vvO3AKChC0/SZ74xeY95WjYEgFhVz+bXlACfYZWEKe4ZDbbbafGAcCuMF04yIgs > >>>>>> = > >>>>>> =6QM1 > >>>>>> -----END PGP SIGNATURE----- > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> users@lists.open-mpi.org > >>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> users@lists.open-mpi.org > >>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>> _______________________________________________ > >>>> users mailing list > >>>> users@lists.open-mpi.org > >>>> https://lists.open-mpi.org/mailman/listinfo/users > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users