Add support for programming PMU counters and reading their values
in
runtime bypassing kernel completely.
This is especially useful in cases where CPU cores are isolated i.e
run dedicated tasks. In such cases one cannot use standard perf
utility without sacrificing latency and performance.
Signed-off-by: Tomasz Duszynski <tduszyn...@marvell.com>
Acked-by: Morten Brørup <m...@smartsharesystems.com>
[...]
+int
+__rte_pmu_enable_group(void)
+{
+ struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+ int ret;
+
+ if (rte_pmu.num_group_events == 0)
+ return -ENODEV;
+
+ ret = open_events(group);
+ if (ret)
+ goto out;
+
+ ret = mmap_events(group);
+ if (ret)
+ goto out;
+
+ if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET,
PERF_IOC_FLAG_GROUP) == -
1) {
+ ret = -errno;
+ goto out;
+ }
+
+ if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
PERF_IOC_FLAG_GROUP) ==
-1) {
+ ret = -errno;
+ goto out;
+ }
+
+ rte_spinlock_lock(&rte_pmu.lock);
+ TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
Hmm.. so we insert pointer to TLS variable into the global list?
Wonder what would happen if that thread get terminated?
Nothing special. Any pointers to that thread-local in that thread are
invalided.
Can memory from its TLS block get re-used (by other thread or for
other
purposes)?
Why would any other thread reuse that?
Eventually main thread will need that data to do the cleanup.
I understand that main thread would need to access that data.
I am not sure that it would be able to.
Imagine thread calls rte_pmu_read(...) and then terminates, while
program
continues to run.
Is the example you describe here (i.e. a thread terminating in the middle
of
doing something) really a scenario DPDK is supposed to
support?
I am not talking about some abnormal termination.
Then I misunderstood your example; I thought you meant the tread was
terminated while inside the rte_pmu_read() function.
We do have ability to spawn control threads, user can spawn his own thread,
all these
threads can have limited life-time.
Not to mention about rte_thread_register()/rte_thread_unregister().
I agree that normal thread termination should be supported.
As I understand address of its RTE_PER_LCORE(_event_group) will still
remain
in rte_pmu.event_group_list,
even if it is probably not valid any more.
There should be a "destructor/done/finish" function available to remove
this
from the list.
[...]
Even if we'd decide to keep rte_pmu_read() as static inline (still
not
sure it is a good idea),
We want to save as much cpu cycles as we possibly can and inlining
does
helps
in that matter.
Ok, so asking same question from different thread: how many cycles it
will
save?
What is the difference in terms of performance when you have this
function
inlined vs not inlined?
We expect to use this in our in-house profiler library. For this reason, I
have a very strong preference for absolute maximum
performance.
Reading PMU events is for performance profiling, so I expect other
potential
users of the PMU library to share my opinion on this.
Well, from my perspective 14 cycles are not that much...
For reference, the i40e testpmd per-core performance report shows that it uses
36 cycles per packet.
This is a total of 1152 cycles per burst of 32 packets. 14 cycles overhead per
burst / 1152 cycles per burst = 1.2 % overhead.
But that is not all: If the application's pipeline has three stages, where the
PMU counters are read for each stage, the per-invocation overhead of 14 cycles
adds up, and the overhead per burst is now 3 * 14 / 1152 = 3.6 %.
I was too fast on the keyboard here... If the application does more work than
testpmd, it certainly also uses more than 1152 cycles to do that work. So
please ignore the 3.6 % as a wild exaggeration from an invalid example, and
just stick with the 1.2 % overhead - which I still consider significant, and
thus worth avoiding.
Wonder can we do both - hide struct rte_pmu_event_group from public API
and have inline function to read pmu stats?
if we can add a separate function that will allow user to get
struct perf_event_mmap_page * for given event index (or event name),
then later user can use
__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
directly.
Generalizing...
In my example here, the same function with 14 wasted cycles is called three
times. It might as well be three individual libraries each wasting 14 cycles
in its individual fast path processing function, due to a similarly relaxed
attitude regarding wasting 14 cycles.
My point is:
Real applications do much more work than testpmd, so all this "insignificant"
extra overhead in the libraries adds up!
Generally, I would like the DPDK Project to remain loyal to its original
philosophy, where performance is considered a Key Performance Indicator, and
overhead in the fast path is kept at an absolute minimum.
Though yes, it would be good to hear more opinions here.