RE: [PATCH 1/2] Enable OpenMP efficient performance profiling via ITT tracing

Vitaly Slobodskoy Wed, 28 Oct 2020 06:56:25 -0700

Hi Jakub,

Thanks for your comments! Please see my reply below.


> > In order to optimize OpenMP workloads, it is quite important to have a
> dedicated performance analysis tool familiar with the OpenMP runtime
> specifics. The typical OpenMP performance issues are:
> > - Not all the performance-critical code is parallel, serial execution
> > significantly affects scaling (Amdahl's law)
> > - Work balance is not good, not all the cores doing useful work
> > - Overhead on synchronization, scheduling, threads creation
> 
> So, first thing is that this brings in quite large code that clearly was 
> written by
> somebody else, so the most important question is what is the upstream repo
> for it, what is the license, for the steering committee the question is if we
> can allow it in the GCC codebase or whether instead if the library is
> configured in certain way we just shouldn't require users to have that library
> installed instead.
> If it is included, we need a process of updates from the upstream repo being
> synced into GCC tree.

 [Vitaly Slobodskoy] Well, ITT API (https://github.com/intel/ittapi) is 
licensed under joint GPLv2 and 3-Clause BSD licenses, can be included within 
the GPL projects, for example, LLVM has already integrated it 
(https://github.com/llvm/llvm-project/tree/master/openmp/runtime/src/thirdparty/ittnotify).
 This is just an API in fact listing all the possible tracing routines 
definitions within the ittnotify.h header file with very small amount of logic 
responsible for loading ITT trace collector and binding the callbacks. ITT API 
is evolved primarily via adding new tracing APIs (however, this happens very 
rarely), existing APIs are not removed. As result, all the code instrumented by 
older ITT API version can be perfectly supported by the performance tools 
compiled with the latest ITT API.
So, it might not be expected to update ITT API within gcc (frequently).

> > This proposal adds new "--disable-itt-instrumentation" configure option
> which completely disables (removes) all the tracing. The tracing is ON by
> default.
> > OpenMP Imbalance time calculation is not included in this patch.
> 
> Second thing, making this on by default is a very bad idea.
> Most people will not need it and it will just slow things down.

 [Vitaly Slobodskoy] Well, opposite if tracing is off by default, almost nobody 
would be able to benefit from these capabilities, because in order to analyze 
performance of OpenMP workloads, it would be required to recompile the compiler 
to get instrumented runtime. I'd also like to underline that trace points are 
added within non-performance critical code, and the overhead of the 
instrumentation is negligible if there is no data collector - just single 
comparison of pointer with NULL for every trace point. As result, this 
instrumentation itself has negligible impact on the performance of OpenMP 
workloads.

> Also, OpenMP 5.0 adds OMPT support which is exactly meant for tracing,
> wouldn't it be better to add OMPT support and then add an ITT plugin as one
> of the perhaps multiple users of OMPT?

[Vitaly Slobodskoy] Right, this is an alternative to OMPT solution. There are 
few reasons why I'm proposing ITT-based solution:
1) There is still no OMPT support within gcc. Are there any plans to add its 
support?
2) OMPT implies higher performance overhead, especially for imbalance time 
calculation. As OMPT provides a set of dedicated callbacks, in order to 
calculate imbalance time for parallel loop, it would be required to handle 
ompt_callback_sync_region_wait callback for every thread. In case of ITT an 
imbalance time can be calculated within the runtime and callback master thread 
just once with the actual imbalance time.
3) There are existing OpenMP runtimes (LLVM OpenMP runtime, Intel Compiler 
OpenMP runtime) instrumented with ITT and tools (e.g. Intel VTune) relying on 
ITT instrumentation within OpenMP runtime.
 
> > +#ifdef ENABLE_ITT_INSTRUMENTATION
> > +static __itt_domain* s_gomp_parallel_domain = NULL; #endif
> 
> Code in libgomp proper needs to follow the GCC Coding Conventions.
> So, e.g. space before * rather than after it.

[Vitaly Slobodskoy] Sure, thanks for this and other catches!

> 
> > +#ifdef ENABLE_ITT_INSTRUMENTATION
> > +  if (__itt_frame_submit_v3_ptr)
> > +  {
> 
> { indented 2 spaces more than the if, and the body another 2.
> 
> > +    if (!s_gomp_parallel_domain)
> > +    {
> > +      s_gomp_parallel_domain = __itt_domain_create("$omp$parallel");
> > +      __itt_thread_set_name("OMP Master Thread");
> 
> Spaces before ( in calls.
> 
> > +    __itt_frame_submit_v3(s_gomp_parallel_domain, NULL,
> > +      parallel_region_begin_ts, parallel_region_end_ts);
> 
> The arguments should be aligned (i.e. parallel_region_begin_ts below
> s_gomp_parallel_domain).  Also, all the itt specific variable names should
> have itt somewhere in it.
> 
> > +#ifdef ENABLE_ITT_INSTRUMENTATION
> > +  char thread_name[30];
> > +#endif
> 
> This isn't acceptable.  Can't you instead just gomp_alloca the space for the
> string in the conditional block?

[Vitaly Slobodskoy] Yes, thanks for the suggestion!


> >  #if defined HAVE_TLS || defined USE_EMUTLS
> >    thr = &gomp_tls_data;
> > @@ -95,6 +103,14 @@ gomp_thread_start (void *xdata)
> >
> >    thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;
> >
> > +#ifdef ENABLE_ITT_INSTRUMENTATION
> > +  if (__itt_frame_submit_v3_ptr)
> > +  {
> > +   snprintf(thread_name, 30, "OMP Worker Thread #%d", data-
> >ts.team_id);
> > +   __itt_thread_set_name(thread_name);
> > +  }
> 
> Also, this doesn't really deal with nested parallelism or host teams etc.

[Vitaly Slobodskoy] Right, however this is done by design. It is not clear how 
to handle nested parallelism on performance analysis tool. It is quite 
important to distinguish serial execution with parallel execution, while 
distinguishing parallel execution with nested parallel execution has less 
importance. Usually it is needed to determine if CPU cores are fully utilized 
within the parallel region, so the imbalance time is supposed to be calculated 
properly counting nested parallelism.

> 
>       Jakub

RE: [PATCH 1/2] Enable OpenMP efficient performance profiling via ITT tracing

Reply via email to