> From: Tomasz Duszynski [mailto:tduszyn...@marvell.com] > Sent: Wednesday, 11 January 2023 17.21 > > >From: Morten Brørup <m...@smartsharesystems.com> > >Sent: Wednesday, January 11, 2023 10:06 AM > > > >> From: Tomasz Duszynski [mailto:tduszyn...@marvell.com] > >> Sent: Wednesday, 11 January 2023 00.47 > >> > >> Add support for programming PMU counters and reading their values in > >> runtime bypassing kernel completely. > >> > >> This is especially useful in cases where CPU cores are isolated > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use > >> standard perf utility without sacrificing latency and performance. > >> > >> Signed-off-by: Tomasz Duszynski <tduszyn...@marvell.com> > >> --- > > > >[...] > > > >> +static int > >> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int > >> group_fd) > >> +{ > >> + struct perf_event_attr attr = { > >> + .size = sizeof(struct perf_event_attr), > >> + .type = PERF_TYPE_RAW, > >> + .exclude_kernel = 1, > >> + .exclude_hv = 1, > >> + .disabled = 1, > >> + }; > >> + > >> + pmu_arch_fixup_config(config); > >> + > >> + attr.config = config[0]; > >> + attr.config1 = config[1]; > >> + attr.config2 = config[2]; > >> + > >> + return syscall(SYS_perf_event_open, &attr, 0, > >> rte_lcore_to_cpu_id(lcore_id), group_fd, 0); > >> +} > > > >If SYS_perf_event_open() must be called from the worker thread itself, > then lcore_id must not be > >passed as a parameter to do_perf_event_open(). Otherwise, I would > expect to be able to call > >do_perf_event_open() from the main thread and pass any lcore_id of a > worker thread. > >This comment applies to all functions that must be called from the > worker thread itself. It also > >applies to the functions that call such functions. > > > > Lcore_id is being passed around so that we don't need to call > rte_lcore_id() each and every time.
Please take a look at the rte_lcore_id() implementation. :-) Regardless, my argument still stands: If a function cannot be called with the lcore_id parameter set to any valid lcore id, it should not be a parameter to the function. > > >[...] > > > >> +/** > >> + * A structure describing a group of events. > >> + */ > >> +struct rte_pmu_event_group { > >> + int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */ > >> + struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; > >> /**< array of user pages */ > >> + bool enabled; /**< true if group was enabled on particular lcore > >> */ > >> +}; > >> + > >> +/** > >> + * A structure describing an event. > >> + */ > >> +struct rte_pmu_event { > >> + char *name; /** name of an event */ > >> + unsigned int index; /** event index into fds/mmap_pages */ > >> + TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ }; > > > >Move the "enabled" field up, making it the first field in this > structure. This might reduce the > >number of instructions required to check (!group->enabled) in > rte_pmu_read(). > > > > This will be called once and no this will not produce more > instructions. Why should it? It seems I was not clearly describing my intention here here. rte_pmu_read() a hot function, where the comparison "if (!group->enabled)" itself will be executed many times. > In both cases compiler will need to load data at some offset and archs > do have instructions for that. Yes, the instructions are: address = BASE + sizeof(struct rte_pmu_event_group) * lcore_id + offsetof(struct rte_pmu_event, enabled). I meant you could avoid the extra instructions stemming from the addition: "+ offsetof()". But you are right... Both BASE and offsetof(struct rte_pmu_event, enabled) are known in advance, and can be merged at compile time to avoid the addition. > > >Also, each instance of the structure is used individually per lcore, > so the structure should be > >cache line aligned to avoid unnecessarily crossing cache lines. > > > >I.e.: > > > >struct rte_pmu_event_group { > > bool enabled; /**< true if group was enabled on particular lcore > */ > > int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */ > > struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; > /**< array of user pages */ } > >__rte_cache_aligned; > > Yes, this can be aligned. While at it, I'd be more inclined to move > mmap_pages up instead of enable. Yes, moving up mmap_pages is better. > > > > >> + > >> +/** > >> + * A PMU state container. > >> + */ > >> +struct rte_pmu { > >> + char *name; /** name of core PMU listed under > >> /sys/bus/event_source/devices */ > >> + struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore > >> event group data */ > >> + unsigned int num_group_events; /**< number of events in a group > >> */ > >> + TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching > >> events */ > >> +}; > >> + > >> +/** Pointer to the PMU state container */ extern struct rte_pmu > >> +rte_pmu; > > > >Just "The PMU state container". It is not a pointer anymore. :-) > > > > Good catch. > > >[...] > > > >> +/** > >> + * @internal > >> + * > >> + * Read PMU counter. > >> + * > >> + * @param pc > >> + * Pointer to the mmapped user page. > >> + * @return > >> + * Counter value read from hardware. > >> + */ > >> +__rte_internal > >> +static __rte_always_inline uint64_t > >> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) { > >> + uint64_t width, offset; > >> + uint32_t seq, index; > >> + int64_t pmc; > >> + > >> + for (;;) { > >> + seq = pc->lock; > >> + rte_compiler_barrier(); > >> + index = pc->index; > >> + offset = pc->offset; > >> + width = pc->pmc_width; > >> + > > > >Please add a comment here about the special meaning of index == 0. > > Okay. > > > > >> + if (likely(pc->cap_user_rdpmc && index)) { > >> + pmc = rte_pmu_pmc_read(index - 1); > >> + pmc <<= 64 - width; > >> + pmc >>= 64 - width; > >> + offset += pmc; > >> + } > >> + > >> + rte_compiler_barrier(); > >> + > >> + if (likely(pc->lock == seq)) > >> + return offset; > >> + } > >> + > >> + return 0; > >> +} > > > >[...] > > > >> +/** > >> + * @warning > >> + * @b EXPERIMENTAL: this API may change without prior notice > >> + * > >> + * Read hardware counter configured to count occurrences of an > event. > >> + * > >> + * @param index > >> + * Index of an event to be read. > >> + * @return > >> + * Event value read from register. In case of errors or lack of > >> support > >> + * 0 is returned. In other words, stream of zeros in a trace file > >> + * indicates problem with reading particular PMU event register. > >> + */ > >> +__rte_experimental > >> +static __rte_always_inline uint64_t > >> +rte_pmu_read(unsigned int index) > >> +{ > >> + struct rte_pmu_event_group *group; > >> + int ret, lcore_id = rte_lcore_id(); > >> + > >> + group = &rte_pmu.group[lcore_id]; > >> + if (unlikely(!group->enabled)) { > >> + ret = rte_pmu_enable_group(lcore_id); > >> + if (ret) > >> + return 0; > >> + > >> + group->enabled = true; > > > >Group->enabled should be set inside rte_pmu_enable_group(), not here. > > > > This is easier to follow imo and not against coding guidelines so I > prefer to leave it as is. OK. It makes the rte_pmu_read() source code slightly shorter, but probably has zero effect on the generated code. No strong preference - feel free to follow your personal preference on this. > > >> + } > >> + > >> + if (unlikely(index >= rte_pmu.num_group_events)) > >> + return 0; > >> + > >> + return rte_pmu_read_userpage(group->mmap_pages[index]); > >> +} > > >