Resending including Thomas , also with some changes. Sorry for the spam Based on Thomas and Peterz feedback Can think of two design variants which target:
-Support monitoring and allocating using the same resctrl group. user can use a resctrl group to allocate resources and also monitor them (with respect to tasks or cpu) -Also allows monitoring outside of resctrl so that user can monitor subgroups who use the same closid. This mode can be used when user wants to monitor more than just the resctrl groups. The first design version uses and modifies perf_cgroup, second version builds a new interface resmon. The first version is close to the patches sent with some additions/changes. This includes details of the design as per Thomas/Peterz feedback. 1> First Design option: without modifying the resctrl and using perf -------------------------------------------------------------------- -------------------------------------------------------------------- In this design everything in resctrl interface works like before (the info, resource group files like task schemata all remain the same) Monitor cqm using perf ---------------------- perf can monitor individual tasks using the -t option just like before. # perf stat -e llc_occupancy -t PID1,PID2 user can monitor the cpu occupancy using the -C option in perf: # perf stat -e llc_occupancy -C 5 Below shows how user can monitor cgroup occupancy: # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g2 # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 To monitor a resctrl group, user can group the same tasks in resctrl group into the cgroup. To monitor the tasks in p1 in example 2 below, add the tasks in resctrl group p1 to cgroup g1 # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks Introducing a new option for resctrl may complicate monitoring because supporting cgroup 'task groups' and resctrl 'task groups' leads to situations where: if the groups intersect, then there is no way to know what l3_allocations contribute to which group. ex: p1 has tasks t1, t2, t3 g1 has tasks t2, t3, t4 The only way to get occupancy for g1 and p1 would be to allocate an RMID for each task which can as well be done with the -t option. Monitoring cqm cgroups Implementation ------------------------------------- When monitoring two different cgroups in the same hierarchy (ex say g11 has an ancestor g1 which are both being monitored as shown below) we need the g11 counts to be considered for g1 as well. # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g1/g11 When measuring for g1 llc_occupancy we cannot write two different RMIDs (because we need to count for g11 as well) during context switch to measure the occupancy for both g1 and g11. Hence the driver maintains this information and writes the RMID of the lowest member in the ancestory which is being monitored during ctx switch. The cqm_info is added to the perf_cgroup structure to maintain this information. The structure is allocated and destroyed at css_alloc and css_free. All the events tied to a cgroup can use the same information while reading the counts. struct perf_cgroup { #ifdef CONFIG_INTEL_RDT_M void *cqm_info; #endif ... } struct cqm_info { bool mon_enabled; int level; u32 *rmid; struct cgrp_cqm_info *mfa; struct list_head tskmon_rlist; }; Due to the hierarchical nature of cgroups, every cgroup just monitors for the 'nearest monitored ancestor' at all times. Since root cgroup is always monitored, all descendents at boot time monitor for root and hence all mfa points to root except for root->mfa which is NULL. 1. RMID setup: When cgroup x start monitoring: for each descendent y, if y's mfa->level < x->level, then y->mfa = x. (Where level of root node = 0...) 2. sched_in: During sched_in for x if (x->mon_enabled) choose x->rmid else choose x->mfa->rmid. 3. read: for each descendent of cgroup x if (x->monitored) count += rmid_read(x->rmid). 4. evt_destroy: for each descendent y of x, if (y->mfa == x) then y->mfa = x->mfa. Meaning if any descendent was monitoring for x, set that descendent to monitor for the cgroup which x was monitoring for. To monitor a task in a cgroup x along with monitoring cgroup x itself cqm_info maintains a list of tasks that are being monitored in the cgroup. When a task which belongs to a cgroup x is being monitored, it always uses its own task->rmid even if cgroup x is monitored during sched_in. To account for the counts of such tasks, cgroup keeps this list and parses it during read. taskmon_rlist is used to maintain the list. The list is modified when a task is attached to the cgroup or removed from the group. Example 1 (Some examples modeled from resctrl ui documentation) --------- A single socket system which has real-time tasks running on core 4-7 and non real-time workload assigned to core 0-3. The real-time tasks share text and data, so a per task association is not required and due to interaction with the kernel it's desired that the kernel on these cores shares L3 with the tasks. # cd /sys/fs/resctrl # echo "L3:0=3ff" > schemata core 0-1 are assigned to the new group and make sure that the kernel and the tasks running there get 50% of the cache. # echo 03 > p0/cpus monitor the cpus 0-1 # perf stat -e llc_occupancy -C 0-1 Example 2 --------- A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the cache. # cd /sys/fs/resctrl # mkdir p1 # echo "L3:0=0f00;1=ffff" > p1/schemata # echo 5678 > p1/tasks # taskset -cp 2-3 5678 To monitor the same group of tasks create a cgroup g1 # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # perf stat -e llc_occupancy -a -G g1 Example 3 --------- sometimes user may just want to profile the cache occupancy first before assigning any CLOSids. Also this provides an override option where user can monitor some tasks which have say CLOS 0 that he is about to place in a CLOSId based on the amount of cache occupancy. This could apply to the same real time tasks above where user is caliberating the % of cache thats needed. # perf stat -e llc_occupancy -t PIDx,PIDy RMID allocation --------------- RMIDs are allocated per package to achieve better scaling of RMIDs. RMIDs are plenty (2-4 per logical processor) and also are per package meaning a two socket system would have twice the number of RMIDs. If we still run out of RMIDs an error is thrown that monitoring wasnt possible as the RMID wasnt available. Kernel Scheduling ----------------- During ctx switch cqm choses the RMID in the following priority 1. if cpu has a RMID , choose that 2. if the task has a RMID directly tied to it choose that (task is monitored) 3. choose the RMID of the task's cgroup (by default tasks belong to root cgroup with RMID 0) Read ---- When user calls cqm to retrieve the monitored count, we read the counter_msr and return the count. For cgroup hierarcy , the count is measured as explained in the cgroup implementation section by traversing the cgroup hierarchy. 2> Second Design option: Build a new usermode tool resmon --------------------------------------------------------- --------------------------------------------------------- In this design everything in resctrl interface works like before (the info, resource group files like task schemata all remain the same). This version supports monitoring resctrl groups directly. But we need a user interface for the user to read the counters. We can create one file to set monitoring and one file in resctrl directory which will reflect the counts but may not be efficient as a lot of times user reads the counts frequently. Build a new user mode interface resmon -------------------------------------- Since modifying the existing perf to suit the different h/w architecture seems to not follow the CAT interface model, it may well be better to have a different and dedicated interface for the RDT monitoring (just like we had a new fs for CAT) resmon supports monitoring a resctrl group or a task. The two modes may provide enough granularity needed for monitoring -can monitor cpu data. -can monitor per resctrl group data. -can choose custom or subset of tasks with in a resctrl group and monitor. # resmon [<options>] -r <resctrl group> -t <PID> -s <mon_mask> -I <time in ms> "resctrl group": is the resctrl directory. "mon_mask: is a bit mask of logical packages which indicates which packages user is interested in monitoring. "time in ms": The time for which the monitoring takes place (this can potentially be changed to start and stop/read options) Example 1 (Some examples modeled from resctrl ui documentation) --------- A single socket system which has real-time tasks running on core 4-7 and non real-time workload assigned to core 0-3. The real-time tasks share text and data, so a per task association is not required and due to interaction with the kernel it's desired that the kernel on these cores shares L3 with the tasks. # cd /sys/fs/resctrl # mkdir p0 # echo "L3:0=3ff" > p0/schemata core 0-1 are assigned to the new group and make sure that the kernel and the tasks running there get 50% of the cache. # echo 03 > p0/cpus monitor the cpus 0-1 for 10s. # resmon -r p0 -s 1 -I 10000 Example 2 --------- A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the cache. # cd /sys/fs/resctrl # mkdir p1 # echo "L3:0=0f00;1=ffff" > p1/schemata # echo 5678 > p1/tasks # taskset -cp 2-3 5678 Monitor the task for 5s on socket zero # resmon -r p1 -s 1 -I 5000 Example 3 --------- sometimes user may just want to profile the cache occupancy first before assigning any CLOSids. Also this provides an override option where user can monitor some tasks which have say CLOS 0 that he is about to place in a CLOSId based on the amount of cache occupancy. This could apply to the same real time tasks above where user is caliberating the % of cache thats needed. # resmon -t PIDx,PIDy -s 1 -I 10000 returns the sum of count of PIDx and PIDy RMID Allocation --------------- This would remain the same like design version 1, where we support per package RMIDs and throw error when out of RMIDs due to h/w limited RMIDs.