> -----Original Message-----
> > Does it always occur at 20+ minutes elapsed ?
> 
> Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a 
> couple
> of seconds mark. For comparison, every time it runs, it runs for 2-3 seconds
> total. So it seems like what might actually be happening here is a hang, and
> not a failure of the test per se.
> 

I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 (although 
this was OpenSuSE, not Redhat), and it looked to me like one of tests were 
hanging, but I didn't have time to investigate it further.

Thanks
Edgar

> > Is there some mechanism that automatically kills a job if it does not write
> anything to stdout for some time ?
> >
> > A quick way to rule that out is to
> >
> > srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800
> >
> > and see if that completes or get killed with the same error message.
> 
> I was not aware of anything like that, but I’ll look into it now (running your
> suggestion). I guess we don’t run across this sort of thing very often — most
> stuff at least prints output when it starts.
> 
> > You can also run use mpirun instead of srun, and even run mpirun
> > outside of slurm
> >
> > (if your cluster policy allows it, you can for example use mpirun and
> > run on the frontend node)
> 
> I’m on the team that manages the cluster, so we can try various things. Every
> piece of software we ever run, though, runs via srun — we don’t provide
> mpirun as a matter of course, except in some corner cases.
> 
> > On 2/21/2019 3:01 AM, Ryan Novosielski wrote:
> >> Does it make any sense that it seems to work fine when OpenMPI and
> HDF5 are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with
> RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5
> build, I did try an XFS filesystem and it didn’t help. GPFS works fine for 
> either
> of the 7.4 and 8.2 builds.
> >>
> >> Just as a reminder, since it was reasonably far back in the thread, what
> I’m doing is running the “make check” tests in HDF5 1.10.4, in part because
> users use it, but also because it seems to have a good test suite and I can
> therefore verify the compiler and MPI stack installs. I get very little
> information, apart from it not working and getting that “Alarm clock”
> message.
> >>
> >> I originally suspected I’d somehow built some component of this with a
> host-specific optimization that wasn’t working on some compute nodes. But I
> controlled for that and it didn’t seem to make any difference.
> >>
> >> --
> >> ____
> >> || \\UTGERS,        
> >> |---------------------------*O*---------------------------
> >> ||_// the State     |         Ryan Novosielski - novos...@rutgers.edu
> >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> >> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
> Newark
> >>      `'
> >>
> >>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski <novos...@rutgers.edu>
> wrote:
> >>>
> >>> It didn’t work any better with XFS, as it happens. Must be something
> else. I’m going to test some more and see if I can narrow it down any, as it
> seems to me that it did work with a different compiler.
> >>>
> >>> --
> >>> ____
> >>> || \\UTGERS,       
> >>> |---------------------------*O*---------------------------
> >>> ||_// the State    |         Ryan Novosielski - novos...@rutgers.edu
> >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> >>> ||  \\    of NJ    | Office of Advanced Research Computing - MSB
> C630, Newark
> >>>     `'
> >>>
> >>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar
> <egabr...@central.uh.edu> wrote:
> >>>>
> >>>> While I was working on something else, I let the tests run with Open
> MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1
> release), and here is what I found for the HDF5 1.10.4 tests on my local
> desktop:
> >>>>
> >>>> In the testpar directory, there is in fact one test that fails for both
> ompio and romio321 in exactly the same manner.
> >>>> I used 6 processes as you did (although I used mpirun directly  instead
> of srun...) From the 13 tests in the testpar directory, 12 pass correctly
> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term,
> t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).
> >>>>
> >>>> The one tests that officially fails ( t_pflush1) actually reports that it
> passed, but then throws message that indicates that MPI_Abort has been
> called, for both ompio and romio. I will try to investigate this test to see 
> what
> is going on.
> >>>>
> >>>> That being said, your report shows an issue in t_mpi, which passes
> without problems for me. This is however not GPFS, this was an XFS local file
> system. Running the tests on GPFS are on my todo list as well.
> >>>>
> >>>> Thanks
> >>>> Edgar
> >>>>
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> >>>>> Gabriel, Edgar
> >>>>> Sent: Sunday, February 17, 2019 10:34 AM
> >>>>> To: Open MPI Users <users@lists.open-mpi.org>
> >>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems
> >>>>> w/OpenMPI
> >>>>> 3.1.3
> >>>>>
> >>>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I
> >>>>> have access to a GPFS file system since recently, and will report
> >>>>> back on that, but it will take a few days.
> >>>>>
> >>>>> Thanks
> >>>>> Edgar
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf
> >>>>>> Of Ryan Novosielski
> >>>>>> Sent: Sunday, February 17, 2019 2:37 AM
> >>>>>> To: users@lists.open-mpi.org
> >>>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems
> >>>>>> w/OpenMPI
> >>>>>> 3.1.3
> >>>>>>
> >>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>> Hash: SHA1
> >>>>>>
> >>>>>> This is on GPFS. I'll try it on XFS to see if it makes any difference.
> >>>>>>
> >>>>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
> >>>>>>> Ryan,
> >>>>>>>
> >>>>>>> What filesystem are you running on ?
> >>>>>>>
> >>>>>>> Open MPI defaults to the ompio component, except on Lustre
> >>>>>>> filesystem where ROMIO is used. (if the issue is related to
> >>>>>>> ROMIO, that can explain why you did not see any difference, in
> >>>>>>> that case, you might want to try an other filesystem (local
> >>>>>>> filesystem or NFS for example)\
> >>>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Gilles
> >>>>>>>
> >>>>>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski
> >>>>>>> <novos...@rutgers.edu> wrote:
> >>>>>>>> I verified that it makes it through to a bash prompt, but I’m a
> >>>>>>>> little less confident that something make test does doesn’t clear it.
> >>>>>>>> Any recommendation for a way to verify?
> >>>>>>>>
> >>>>>>>> In any case, no change, unfortunately.
> >>>>>>>>
> >>>>>>>> Sent from my iPhone
> >>>>>>>>
> >>>>>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar
> >>>>>>>>> <egabr...@central.uh.edu>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> What file system are you running on?
> >>>>>>>>>
> >>>>>>>>> I will look into this, but it might be later next week. I just
> >>>>>>>>> wanted to emphasize that we are regularly running the parallel
> >>>>>>>>> hdf5 tests with ompio, and I am not aware of any outstanding
> >>>>>>>>> items that do not work (and are supposed to work). That being
> >>>>>>>>> said, I run the tests manually, and not the 'make test'
> >>>>>>>>> commands. Will have to check which tests are being run by that.
> >>>>>>>>>
> >>>>>>>>> Edgar
> >>>>>>>>>
> >>>>>>>>>> -----Original Message----- From: users
> >>>>>>>>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles
> >>>>>>>>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To:
> >>>>>>>>>> Open MPI Users <users@lists.open-mpi.org> Subject: Re:
> >>>>>>>>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> >>>>>>>>>> 3.1.3
> >>>>>>>>>>
> >>>>>>>>>> Ryan,
> >>>>>>>>>>
> >>>>>>>>>> Can you
> >>>>>>>>>>
> >>>>>>>>>> export OMPI_MCA_io=^ompio
> >>>>>>>>>>
> >>>>>>>>>> and try again after you made sure this environment variable
> >>>>>>>>>> is passed by srun to the MPI tasks ?
> >>>>>>>>>>
> >>>>>>>>>> We have identified and fixed several issues specific to the
> >>>>>>>>>> (default) ompio component, so that could be a valid
> >>>>>>>>>> workaround until the next release.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>> Gilles
> >>>>>>>>>>
> >>>>>>>>>> Ryan Novosielski <novos...@rutgers.edu> wrote:
> >>>>>>>>>>> Hi there,
> >>>>>>>>>>>
> >>>>>>>>>>> Honestly don’t know which piece of this puzzle to look at or
> >>>>>>>>>>> how to get more
> >>>>>>>>>> information for troubleshooting. I successfully built HDF5
> >>>>>>>>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running
> >>>>>>>>>> the “make check” in HDF5 is failing at the below point; I am
> >>>>>>>>>> using a value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t
> >>>>>>>>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly
> >>>>>>>>>> configured.
> >>>>>>>>>>> Thanks for any help you can provide.
> >>>>>>>>>>>
> >>>>>>>>>>> make[4]: Entering directory
> >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
> >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
> >>>>>>>>>>> ============================ Testing  t_mpi
> >>>>>>>>>>> ============================ t_mpi  Test Log
> >>>>>>>>>>> ============================ srun: job 84126610 queued
> and
> >>>>>> waiting
> >>>>>>>>>>> for resources srun: job 84126610 has been allocated
> >>>>>>>>>>> resources
> >>>>>>>>>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user
> >>>>>>>>>>> 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata
> >>>>>>>>>>> 5152maxresident)k 0inputs+0outputs
> >>>>> (0major+1529minor)pagefaults
> >>>>>>>>>>> 0swaps make[4]: *** [t_mpi.chkexe_] Error 1 make[4]: Leaving
> >>>>>>>>>>> directory
> >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
> >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
> >>>>>>>>>>> make[3]: *** [build-check-p] Error 1 make[3]: Leaving
> >>>>>>>>>>> directory
> >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
> >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
> >>>>>>>>>>> make[2]: *** [test] Error 2 make[2]: Leaving directory
> >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
> >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
> >>>>>>>>>>> make[1]: *** [check-am] Error 2 make[1]: Leaving directory
> >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
> >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
> >>>>>>>>>>> make: *** [check-recursive] Error 1
> >>>>>>>>>>>
> >>>>>>>>>>> -- ____ || \\UTGERS,
> >>>>>>>>>>> |---------------------------*O*---------------------------
> >>>>>>>>>>> ||_// the State     |         Ryan Novosielski -
> >>>>>>>>>>> novos...@rutgers.edu || \\ University | Sr. Technologist -
> >>>>>>>>>>> 973/972.0922 (2x0922) ~*~ RBHS Campus ||  \\    of NJ     |
> >>>>>>>>>>> Office of Advanced Research Computing - MSB C630, Newark `'
> >>>>>>>>>> _______________________________________________ users
> >>>>> mailing
> >>>>>> list
> >>>>>>>>>> users@lists.open-mpi.org
> >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>>>> _______________________________________________ users
> mailing
> >>>>>> list
> >>>>>>>>> users@lists.open-mpi.org
> >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>>> _______________________________________________ users
> mailing
> >>>>> list
> >>>>>>>> users@lists.open-mpi.org
> >>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>> _______________________________________________ users
> mailing
> >>>>>>> list users@lists.open-mpi.org
> >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>>
> >>>>>> - --
> >>>>>> ____
> >>>>>> || \\UTGERS,     |----------------------*O*------------------------
> >>>>>> ||_// the State  |    Ryan Novosielski - novos...@rutgers.edu
> >>>>>> || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
> >>>>>> ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
> >>>>>>     `'
> >>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>
> >>>>>>
> >>>>>
> iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXGkdJQAKCRCZv6Bp
> >>>>>> 0Ryx
> >>>>>>
> >>>>>
> vvO3AKChC0/SZ74xeY95WjYEgFhVz+bXlACfYZWEKe4ZDbbbafGAcCuMF04yIgs
> >>>>>> =
> >>>>>> =6QM1
> >>>>>> -----END PGP SIGNATURE-----
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users@lists.open-mpi.org
> >>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users@lists.open-mpi.org
> >>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users@lists.open-mpi.org
> >>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to