On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa <vikas.shiva...@linux.intel.com> wrote: > > Resending including Thomas , also with some changes. Sorry for the spam > > Based on Thomas and Peterz feedback Can think of two design > variants which target: > > -Support monitoring and allocating using the same resctrl group. > user can use a resctrl group to allocate resources and also monitor > them (with respect to tasks or cpu) > > -Also allows monitoring outside of resctrl so that user can > monitor subgroups who use the same closid. This mode can be used > when user wants to monitor more than just the resctrl groups. > > The first design version uses and modifies perf_cgroup, second version > builds a new interface resmon. The first version is close to the patches > sent with some additions/changes. This includes details of the design as > per Thomas/Peterz feedback. > > 1> First Design option: without modifying the resctrl and using perf > -------------------------------------------------------------------- > -------------------------------------------------------------------- > > In this design everything in resctrl interface works like > before (the info, resource group files like task schemata all remain the > same) > > > Monitor cqm using perf > ---------------------- > > perf can monitor individual tasks using the -t > option just like before. > > # perf stat -e llc_occupancy -t PID1,PID2 > > user can monitor the cpu occupancy using the -C option in perf: > > # perf stat -e llc_occupancy -C 5 > > Below shows how user can monitor cgroup occupancy: > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g2 > # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks > > # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 > Presented this way, this does not quite address the use case I described earlier here. We want to be able to monitor the cgroup allocations from first thread creation. What you have above has a large gap. Many apps do allocations as their very first steps, so if you do: $ my_test_prg & [1456] $ echo 1456 >/sys/fs/cgroup/perf_event/g2/tasks $ perf stat -e intel_cqm/llc_occupancy/ -a -G g2
You have a race. But if you allow: $ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 (i.e, on an empty cgroup) $ echo $$ >/sys/fs/cgroup/perf_event/g2/tasks (put shell in cgroup, so my_test_prg runs immediately in the cgroup) $ my_test_prg & Then there is a way to avoid the gap. > > To monitor a resctrl group, user can group the same tasks in resctrl > group into the cgroup. > > To monitor the tasks in p1 in example 2 below, add the tasks in resctrl > group p1 to cgroup g1 > > # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks > > Introducing a new option for resctrl may complicate monitoring because > supporting cgroup 'task groups' and resctrl 'task groups' leads to > situations where: > if the groups intersect, then there is no way to know what > l3_allocations contribute to which group. > > ex: > p1 has tasks t1, t2, t3 > g1 has tasks t2, t3, t4 > > The only way to get occupancy for g1 and p1 would be to allocate an RMID > for each task which can as well be done with the -t option. > > Monitoring cqm cgroups Implementation > ------------------------------------- > > When monitoring two different cgroups in the same hierarchy (ex say g11 > has an ancestor g1 which are both being monitored as shown below) we > need the g11 counts to be considered for g1 as well. > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g1/g11 > > When measuring for g1 llc_occupancy we cannot write two different RMIDs > (because we need to count for g11 as well) > during context switch to measure the occupancy for both g1 and g11. > Hence the driver maintains this information and writes the RMID of the > lowest member in the ancestory which is being monitored during ctx > switch. > > The cqm_info is added to the perf_cgroup structure to maintain this > information. The structure is allocated and destroyed at css_alloc and > css_free. All the events tied to a cgroup can use the same > information while reading the counts. > > struct perf_cgroup { > #ifdef CONFIG_INTEL_RDT_M > void *cqm_info; > #endif > ... > > } > > struct cqm_info { > bool mon_enabled; > int level; > u32 *rmid; > struct cgrp_cqm_info *mfa; > struct list_head tskmon_rlist; > }; > > Due to the hierarchical nature of cgroups, every cgroup just > monitors for the 'nearest monitored ancestor' at all times. > Since root cgroup is always monitored, all descendents > at boot time monitor for root and hence all mfa points to root > except for root->mfa which is NULL. > > 1. RMID setup: When cgroup x start monitoring: > for each descendent y, if y's mfa->level < x->level, then > y->mfa = x. (Where level of root node = 0...) > 2. sched_in: During sched_in for x > if (x->mon_enabled) choose x->rmid > else choose x->mfa->rmid. > 3. read: for each descendent of cgroup x > if (x->monitored) count += rmid_read(x->rmid). > 4. evt_destroy: for each descendent y of x, if (y->mfa == x) > then y->mfa = x->mfa. Meaning if any descendent was monitoring for > x, set that descendent to monitor for the cgroup which x was > monitoring for. > > To monitor a task in a cgroup x along with monitoring cgroup x itself > cqm_info maintains a list of tasks that are being monitored in the > cgroup. > > When a task which belongs to a cgroup x is being monitored, it > always uses its own task->rmid even if cgroup x is monitored during sched_in. > To account for the counts of such tasks, cgroup keeps this list > and parses it during read. > taskmon_rlist is used to maintain the list. The list is modified when a > task is attached to the cgroup or removed from the group. > > Example 1 (Some examples modeled from resctrl ui documentation) > --------- > > A single socket system which has real-time tasks running on core 4-7 and > non real-time workload assigned to core 0-3. The real-time tasks share > text and data, so a per task association is not required and due to > interaction with the kernel it's desired that the kernel on these cores > shares L3 > with the tasks. > > # cd /sys/fs/resctrl > > # echo "L3:0=3ff" > schemata > > core 0-1 are assigned to the new group and make sure that the > kernel and the tasks running there get 50% of the cache. > > # echo 03 > p0/cpus > > monitor the cpus 0-1 > > # perf stat -e llc_occupancy -C 0-1 > > Example 2 > --------- > > A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of > the > cache. > > # cd /sys/fs/resctrl > > # mkdir p1 > # echo "L3:0=0f00;1=ffff" > p1/schemata > # echo 5678 > p1/tasks > # taskset -cp 2-3 5678 > > To monitor the same group of tasks create a cgroup g1 > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # perf stat -e llc_occupancy -a -G g1 > > Example 3 > --------- > > sometimes user may just want to profile the cache occupancy first before > assigning any CLOSids. Also this provides an override option where user > can monitor some tasks which have say CLOS 0 that he is about to place > in a CLOSId based on the amount of cache occupancy. This could apply to > the same real time tasks above where user is caliberating the % of cache > thats needed. > > # perf stat -e llc_occupancy -t PIDx,PIDy > > RMID allocation > --------------- > > RMIDs are allocated per package to achieve better scaling of RMIDs. > RMIDs are plenty (2-4 per logical processor) and also are per package > meaning a two socket system would have twice the number of RMIDs. > If we still run out of RMIDs an error is thrown that monitoring wasnt > possible as the RMID wasnt available. > > Kernel Scheduling > ----------------- > > During ctx switch cqm choses the RMID in the following priority > > 1. if cpu has a RMID , choose that > 2. if the task has a RMID directly tied to it choose that (task is > monitored) > 3. choose the RMID of the task's cgroup (by default tasks belong to root > cgroup with RMID 0) > > Read > ---- > > When user calls cqm to retrieve the monitored count, we read the > counter_msr and return the count. For cgroup hierarcy , the count is > measured as explained in the cgroup implementation section by traversing > the cgroup hierarchy. > > > 2> Second Design option: Build a new usermode tool resmon > --------------------------------------------------------- > --------------------------------------------------------- > > In this design everything in resctrl interface works like > before (the info, resource group files like task schemata all remain the > same). > > This version supports monitoring resctrl groups directly. > But we need a user interface for the user to read the counters. We can > create one file to set monitoring and one > file in resctrl directory which will reflect the counts but may not be > efficient as a lot of times user reads the counts frequently. > > Build a new user mode interface resmon > -------------------------------------- > > Since modifying the existing perf to > suit the different h/w architecture seems to not follow the CAT > interface model, it may well be better to have a different and dedicated > interface for the RDT monitoring (just like we had a new fs for CAT) > > resmon supports monitoring a resctrl group or a task. The two modes may > provide enough granularity needed for monitoring > -can monitor cpu data. > -can monitor per resctrl group data. > -can choose custom or subset of tasks with in a resctrl group and monitor. > > # resmon [<options>] > -r <resctrl group> > -t <PID> > -s <mon_mask> > -I <time in ms> > > "resctrl group": is the resctrl directory. > > "mon_mask: is a bit mask of logical packages which indicates which packages > user is > interested in monitoring. > > "time in ms": The time for which the monitoring takes place > (this can potentially be changed to start and stop/read options) > > Example 1 (Some examples modeled from resctrl ui documentation) > --------- > > A single socket system which has real-time tasks running on core 4-7 and > non real-time workload assigned to core 0-3. The real-time tasks share > text and data, so a per task association is not required and due to > interaction with the kernel it's desired that the kernel on these cores > shares L3 > with the tasks. > > # cd /sys/fs/resctrl > # mkdir p0 > # echo "L3:0=3ff" > p0/schemata > > core 0-1 are assigned to the new group and make sure that the > kernel and the tasks running there get 50% of the cache. > > # echo 03 > p0/cpus > > monitor the cpus 0-1 for 10s. > > # resmon -r p0 -s 1 -I 10000 > > Example 2 > --------- > > A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of > the > cache. > > # cd /sys/fs/resctrl > > # mkdir p1 > # echo "L3:0=0f00;1=ffff" > p1/schemata > # echo 5678 > p1/tasks > # taskset -cp 2-3 5678 > > Monitor the task for 5s on socket zero > > # resmon -r p1 -s 1 -I 5000 > > Example 3 > --------- > > sometimes user may just want to profile the cache occupancy first before > assigning any CLOSids. Also this provides an override option where user > can monitor some tasks which have say CLOS 0 that he is about to place > in a CLOSId based on the amount of cache occupancy. This could apply to > the same real time tasks above where user is caliberating the % of cache > thats needed. > > # resmon -t PIDx,PIDy -s 1 -I 10000 > > returns the sum of count of PIDx and PIDy > > RMID Allocation > --------------- > > This would remain the same like design version 1, where we support per > package RMIDs and throw error when out of RMIDs due to h/w limited > RMIDs. > >