On Mon, Jul 9, 2018 at 7:38 AM, Mark Adams <mfad...@lbl.gov> wrote: > I agree with Matt's comment and let me add (somewhat redundantly) > > >> This isn't how you'd write MPI, is it? No, you'd figure out how to >> decompose your data properly to exploit locality and then implement an >> algorithm that minimizes communication and synchronization. Do that with >> OpenMP. >> > > I have never seen a DOE app that does this correct, get you data model > figured out first, then implement. >
Chris Kerr's weather code (GFDL Hiram) has a single OpenMP parallel region. He was at the last NERSC workshop I attended. You should talk to him. > In fact, in my mind, the only advantage of OMP is that it is incremental. > You can get something running quickly and then incrementally optimize as > resources allow and performance demands. > This is why OpenMP has a bad reputation. We have customers that use OpenMP holistically and get great results. > That is nice in theory, but in practice apps just say "we did it all on > our own without that pesky distributed memory computing" and do science. > Fine. We all have resource limits and have to make decisions. > > Jeff, with your approach, you do all the hard work of distributing your > data intelligently, which must be done regardless of PM, but you are > probably left with a code that has more shared memory algorithms in it than > if you had started with the MPI side. > Good OpenMP shares very little state between threads, although I suppose it doesn't lead to halo buffers and explicit exchanges through them. Is that a bad thing? > I thought *you* were one of the savants that preach shared memory code is > just impossible to make correct for non-trivial codes, and thus hard to > maintain. > Shared-memory programming is hard but so is raising children. Both are feasible with sufficient effort. I argue that: - data sharing by default, which is implied by threading, is bad semantic, but largely academic if you write OpenMP properly; - incremental OpenMP leads to excessive fork-join overhead and death-by-Amdahl; - MPI's failure to support threads properly leads to a lot of stupid design choices in MPI+OpenMP applications; - good OpenMP looks like MPI and requires a lot more work up front than incremental OpenMP. > Case in point. I recently tried to use hypre's OMP support and we had > numerical problems. After a week of digging found a hypre test case (ie, no > PETSc) that seemed to work with -O1 and failed with -O2 (solver just did > not converge and I valgrind seemed clean). (This was using the 'ij' test > problem.) I then ran a PETSc test problem, with this -O1 hypre build, and > it failed. I gave up at that point. Ulrike is in the loop and she agreed it > looked like a compiler problem. > Sadly, I'm not aware of any bug-free compilers. > If Intel can get this hypre test to work they can tell me what they did I > can try it again in PETSc. BTW, I looked at the hypre code and they do not > seem to do much if any fusing, etc. > Yeah, it's hard to get fusing right across subroutines. Fusing only matters when the amount of compute is limited though. Personally I prefer tasks to fusing old-school OpenMP but the implementation support isn't ideal right now. And, this is all anecdotal and I do not want to imply that OMP or hypre or > Intel are bad in any way (in fact I like both hypre and Intel). > > >> >>> Note: that for BLAS 1 operations likely the correct thing to do is >>> turn on MKL BLAS threading (being careful to make sure the number of >>> threads MKL uses matches that used by other parts of the code). This way we >>> don't need to OpenMP optimize many parts of PETSc's vector operations >>> (norm, dot, scale, axpy). In fact, this is the first thing Mark should do, >>> how much does it speed up the vector operations? >>> >> >> BLAS1 operations are all memory-bound unless running out of cache (in >> which case one shouldn't use threads) and compilers do a great job with >> them. Just put the pragmas on and let the compiler do its job. >> >> >>> The problem is how many ECP applications actually use OpenMP just as a >>> #pragma optimization tool, or do they use other features of OpenMP. For >>> example I remember Brian wanted to/did use OpenMP threads directly in >>> BoxLib and didn't just stick to the #pragma model. If they did this then we >>> would need custom PETSc to match their model. >>> >> >> If this implies that BoxLib will use omp-parallel and then use explicit >> threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size >> and omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to >> write OpenMP. >> > > Note, Chombo (Phil Collela) split from BoxLib (John Bell) about 15 years > ago (and added more C++) and BoxLib has been refactored into AMReX. Brian > works with Chombo. Some staff are fungible and go between both projects. I > don't think Brian is fungible. > >> If I say "OpenMP and C++ are great", will I be able to hear the swearing all the way from Buffalo? :-) Jeff -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/