> From: Morten Brørup
> Sent: Tuesday, 28 February 2023 14.16
> 
> > From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com]
> > Sent: Tuesday, 28 February 2023 13.05
> >
> > > > > >> Add support for programming PMU counters and reading their values
> in
> > > > > >> runtime bypassing kernel completely.
> > > > > >>
> > > > > >> This is especially useful in cases where CPU cores are isolated i.e
> > > > > >> run dedicated tasks. In such cases one cannot use standard perf
> > > > > >> utility without sacrificing latency and performance.
> > > > > >>
> > > > > >> Signed-off-by: Tomasz Duszynski <tduszyn...@marvell.com>
> > > > > >> Acked-by: Morten Brørup <m...@smartsharesystems.com>
> > > > > >
> > >
> > > [...]
> > >
> > > > > >> +int
> > > > > >> +__rte_pmu_enable_group(void)
> > > > > >> +{
> > > > > >> +  struct rte_pmu_event_group *group = 
> > > > > >> &RTE_PER_LCORE(_event_group);
> > > > > >> +  int ret;
> > > > > >> +
> > > > > >> +  if (rte_pmu.num_group_events == 0)
> > > > > >> +          return -ENODEV;
> > > > > >> +
> > > > > >> +  ret = open_events(group);
> > > > > >> +  if (ret)
> > > > > >> +          goto out;
> > > > > >> +
> > > > > >> +  ret = mmap_events(group);
> > > > > >> +  if (ret)
> > > > > >> +          goto out;
> > > > > >> +
> > > > > >> +  if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET,
> > PERF_IOC_FLAG_GROUP) == -
> > > > 1) {
> > > > > >> +          ret = -errno;
> > > > > >> +          goto out;
> > > > > >> +  }
> > > > > >> +
> > > > > >> +  if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
> > PERF_IOC_FLAG_GROUP) ==
> > > > -1) {
> > > > > >> +          ret = -errno;
> > > > > >> +          goto out;
> > > > > >> +  }
> > > > > >> +
> > > > > >> +  rte_spinlock_lock(&rte_pmu.lock);
> > > > > >> +  TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> > > > > >
> > > > > >Hmm.. so we insert pointer to TLS variable into the global list?
> > > > > >Wonder what would happen if that thread get terminated?
> > > > >
> > > > > Nothing special. Any pointers to that thread-local in that thread are
> > > > invalided.
> > > > >
> > > > > >Can memory from its TLS block get re-used (by other thread or for
> other
> > > > purposes)?
> > > > > >
> > > > >
> > > > > Why would any other thread reuse that?
> > > > > Eventually main thread will need that data to do the cleanup.
> > > >
> > > > I understand that main thread would need to access that data.
> > > > I am not sure that it would be able to.
> > > > Imagine thread calls rte_pmu_read(...) and then terminates, while
> program
> > > > continues to run.
> > >
> > > Is the example you describe here (i.e. a thread terminating in the middle
> of
> > doing something) really a scenario DPDK is supposed to
> > > support?
> >
> > I am not talking about some abnormal termination.
> 
> Then I misunderstood your example; I thought you meant the tread was
> terminated while inside the rte_pmu_read() function.
> 
> > We do have ability to spawn control threads, user can spawn his own thread,
> > all these
> > threads can have limited life-time.
> > Not to mention about  rte_thread_register()/rte_thread_unregister().
> >
> 
> I agree that normal thread termination should be supported.
> 
> > > > As I understand address of its RTE_PER_LCORE(_event_group) will still
> > remain
> > > > in rte_pmu.event_group_list,
> > > > even if it is probably not valid any more.
> > >
> > > There should be a "destructor/done/finish" function available to remove
> this
> > from the list.
> > >
> > > [...]
> > >
> > > > > >Even if we'd decide to keep rte_pmu_read() as static inline (still
> not
> > > > > >sure it is a good idea),
> > > > >
> > > > > We want to save as much cpu cycles as we possibly can and inlining
> does
> > > > helps
> > > > > in that matter.
> > > >
> > > > Ok, so asking same question from different thread: how many cycles it
> will
> > > > save?
> > > > What is the difference in terms of performance when you have this
> function
> > > > inlined vs not inlined?
> > >
> > > We expect to use this in our in-house profiler library. For this reason, I
> > have a very strong preference for absolute maximum
> > > performance.
> > >
> > > Reading PMU events is for performance profiling, so I expect other
> potential
> > users of the PMU library to share my opinion on this.
> >
> > Well, from my perspective 14 cycles are not that much...
> 
> For reference, the i40e testpmd per-core performance report shows that it uses
> 36 cycles per packet.
> 
> This is a total of 1152 cycles per burst of 32 packets. 14 cycles overhead per
> burst / 1152 cycles per burst = 1.2 % overhead.
> 
> But that is not all: If the application's pipeline has three stages, where the
> PMU counters are read for each stage, the per-invocation overhead of 14 cycles
> adds up, and the overhead per burst is now 3 * 14 / 1152 = 3.6 %.

I was too fast on the keyboard here... If the application does more work than 
testpmd, it certainly also uses more than 1152 cycles to do that work. So 
please ignore the 3.6 % as a wild exaggeration from an invalid example, and 
just stick with the 1.2 % overhead - which I still consider significant, and 
thus worth avoiding.

> 
> Generalizing...
> 
> In my example here, the same function with 14 wasted cycles is called three
> times. It might as well be three individual libraries each wasting 14 cycles
> in its individual fast path processing function, due to a similarly relaxed
> attitude regarding wasting 14 cycles.
> 
> My point is:
> 
> Real applications do much more work than testpmd, so all this "insignificant"
> extra overhead in the libraries adds up!
> 
> Generally, I would like the DPDK Project to remain loyal to its original
> philosophy, where performance is considered a Key Performance Indicator, and
> overhead in the fast path is kept at an absolute minimum.
> 
> > Though yes, it would be good to hear more opinions here.

Reply via email to