> > > >> Add support for programming PMU counters and reading their values in > > > >> runtime bypassing kernel completely. > > > >> > > > >> This is especially useful in cases where CPU cores are isolated i.e > > > >> run dedicated tasks. In such cases one cannot use standard perf > > > >> utility without sacrificing latency and performance. > > > >> > > > >> Signed-off-by: Tomasz Duszynski <tduszyn...@marvell.com> > > > >> Acked-by: Morten Brørup <m...@smartsharesystems.com> > > > > > > [...] > > > > >> +int > > > >> +__rte_pmu_enable_group(void) > > > >> +{ > > > >> + struct rte_pmu_event_group *group = > > > >> &RTE_PER_LCORE(_event_group); > > > >> + int ret; > > > >> + > > > >> + if (rte_pmu.num_group_events == 0) > > > >> + return -ENODEV; > > > >> + > > > >> + ret = open_events(group); > > > >> + if (ret) > > > >> + goto out; > > > >> + > > > >> + ret = mmap_events(group); > > > >> + if (ret) > > > >> + goto out; > > > >> + > > > >> + if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, > > > >> PERF_IOC_FLAG_GROUP) == - > > 1) { > > > >> + ret = -errno; > > > >> + goto out; > > > >> + } > > > >> + > > > >> + if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, > > > >> PERF_IOC_FLAG_GROUP) == > > -1) { > > > >> + ret = -errno; > > > >> + goto out; > > > >> + } > > > >> + > > > >> + rte_spinlock_lock(&rte_pmu.lock); > > > >> + TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next); > > > > > > > >Hmm.. so we insert pointer to TLS variable into the global list? > > > >Wonder what would happen if that thread get terminated? > > > > > > Nothing special. Any pointers to that thread-local in that thread are > > invalided. > > > > > > >Can memory from its TLS block get re-used (by other thread or for other > > purposes)? > > > > > > > > > > Why would any other thread reuse that? > > > Eventually main thread will need that data to do the cleanup. > > > > I understand that main thread would need to access that data. > > I am not sure that it would be able to. > > Imagine thread calls rte_pmu_read(...) and then terminates, while program > > continues to run. > > Is the example you describe here (i.e. a thread terminating in the middle of > doing something) really a scenario DPDK is supposed to > support?
I am not talking about some abnormal termination. We do have ability to spawn control threads, user can spawn his own thread, all these threads can have limited life-time. Not to mention about rte_thread_register()/rte_thread_unregister(). > > As I understand address of its RTE_PER_LCORE(_event_group) will still remain > > in rte_pmu.event_group_list, > > even if it is probably not valid any more. > > There should be a "destructor/done/finish" function available to remove this > from the list. > > [...] > > > > >Even if we'd decide to keep rte_pmu_read() as static inline (still not > > > >sure it is a good idea), > > > > > > We want to save as much cpu cycles as we possibly can and inlining does > > helps > > > in that matter. > > > > Ok, so asking same question from different thread: how many cycles it will > > save? > > What is the difference in terms of performance when you have this function > > inlined vs not inlined? > > We expect to use this in our in-house profiler library. For this reason, I > have a very strong preference for absolute maximum > performance. > > Reading PMU events is for performance profiling, so I expect other potential > users of the PMU library to share my opinion on this. Well, from my perspective 14 cycles are not that much... Though yes, it would be good to hear more opinions here.