Hi Jakub, Thanks for your comments! Please see my reply below.
> > In order to optimize OpenMP workloads, it is quite important to have a > dedicated performance analysis tool familiar with the OpenMP runtime > specifics. The typical OpenMP performance issues are: > > - Not all the performance-critical code is parallel, serial execution > > significantly affects scaling (Amdahl's law) > > - Work balance is not good, not all the cores doing useful work > > - Overhead on synchronization, scheduling, threads creation > > So, first thing is that this brings in quite large code that clearly was > written by > somebody else, so the most important question is what is the upstream repo > for it, what is the license, for the steering committee the question is if we > can allow it in the GCC codebase or whether instead if the library is > configured in certain way we just shouldn't require users to have that library > installed instead. > If it is included, we need a process of updates from the upstream repo being > synced into GCC tree. [Vitaly Slobodskoy] Well, ITT API (https://github.com/intel/ittapi) is licensed under joint GPLv2 and 3-Clause BSD licenses, can be included within the GPL projects, for example, LLVM has already integrated it (https://github.com/llvm/llvm-project/tree/master/openmp/runtime/src/thirdparty/ittnotify). This is just an API in fact listing all the possible tracing routines definitions within the ittnotify.h header file with very small amount of logic responsible for loading ITT trace collector and binding the callbacks. ITT API is evolved primarily via adding new tracing APIs (however, this happens very rarely), existing APIs are not removed. As result, all the code instrumented by older ITT API version can be perfectly supported by the performance tools compiled with the latest ITT API. So, it might not be expected to update ITT API within gcc (frequently). > > This proposal adds new "--disable-itt-instrumentation" configure option > which completely disables (removes) all the tracing. The tracing is ON by > default. > > OpenMP Imbalance time calculation is not included in this patch. > > Second thing, making this on by default is a very bad idea. > Most people will not need it and it will just slow things down. [Vitaly Slobodskoy] Well, opposite if tracing is off by default, almost nobody would be able to benefit from these capabilities, because in order to analyze performance of OpenMP workloads, it would be required to recompile the compiler to get instrumented runtime. I'd also like to underline that trace points are added within non-performance critical code, and the overhead of the instrumentation is negligible if there is no data collector - just single comparison of pointer with NULL for every trace point. As result, this instrumentation itself has negligible impact on the performance of OpenMP workloads. > Also, OpenMP 5.0 adds OMPT support which is exactly meant for tracing, > wouldn't it be better to add OMPT support and then add an ITT plugin as one > of the perhaps multiple users of OMPT? [Vitaly Slobodskoy] Right, this is an alternative to OMPT solution. There are few reasons why I'm proposing ITT-based solution: 1) There is still no OMPT support within gcc. Are there any plans to add its support? 2) OMPT implies higher performance overhead, especially for imbalance time calculation. As OMPT provides a set of dedicated callbacks, in order to calculate imbalance time for parallel loop, it would be required to handle ompt_callback_sync_region_wait callback for every thread. In case of ITT an imbalance time can be calculated within the runtime and callback master thread just once with the actual imbalance time. 3) There are existing OpenMP runtimes (LLVM OpenMP runtime, Intel Compiler OpenMP runtime) instrumented with ITT and tools (e.g. Intel VTune) relying on ITT instrumentation within OpenMP runtime. > > +#ifdef ENABLE_ITT_INSTRUMENTATION > > +static __itt_domain* s_gomp_parallel_domain = NULL; #endif > > Code in libgomp proper needs to follow the GCC Coding Conventions. > So, e.g. space before * rather than after it. [Vitaly Slobodskoy] Sure, thanks for this and other catches! > > > +#ifdef ENABLE_ITT_INSTRUMENTATION > > + if (__itt_frame_submit_v3_ptr) > > + { > > { indented 2 spaces more than the if, and the body another 2. > > > + if (!s_gomp_parallel_domain) > > + { > > + s_gomp_parallel_domain = __itt_domain_create("$omp$parallel"); > > + __itt_thread_set_name("OMP Master Thread"); > > Spaces before ( in calls. > > > + __itt_frame_submit_v3(s_gomp_parallel_domain, NULL, > > + parallel_region_begin_ts, parallel_region_end_ts); > > The arguments should be aligned (i.e. parallel_region_begin_ts below > s_gomp_parallel_domain). Also, all the itt specific variable names should > have itt somewhere in it. > > > +#ifdef ENABLE_ITT_INSTRUMENTATION > > + char thread_name[30]; > > +#endif > > This isn't acceptable. Can't you instead just gomp_alloca the space for the > string in the conditional block? [Vitaly Slobodskoy] Yes, thanks for the suggestion! > > #if defined HAVE_TLS || defined USE_EMUTLS > > thr = &gomp_tls_data; > > @@ -95,6 +103,14 @@ gomp_thread_start (void *xdata) > > > > thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release; > > > > +#ifdef ENABLE_ITT_INSTRUMENTATION > > + if (__itt_frame_submit_v3_ptr) > > + { > > + snprintf(thread_name, 30, "OMP Worker Thread #%d", data- > >ts.team_id); > > + __itt_thread_set_name(thread_name); > > + } > > Also, this doesn't really deal with nested parallelism or host teams etc. [Vitaly Slobodskoy] Right, however this is done by design. It is not clear how to handle nested parallelism on performance analysis tool. It is quite important to distinguish serial execution with parallel execution, while distinguishing parallel execution with nested parallel execution has less importance. Usually it is needed to determine if CPU cores are fully utilized within the parallel region, so the imbalance time is supposed to be calculated properly counting nested parallelism. > > Jakub