On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <t...@linutronix.de> wrote: > > On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > > > > If resctrl groups could lift the restriction of one resctl per CLOSID, > > then the user can create many resctrl in the way perf cgroups are > > created now. The advantage is that there wont be cgroup hierarchy! > > making things much simpler. Also no need to optimize perf event > > context switch to make llc_occupancy work. > > So if I understand you correctly, then you want a mechanism to have groups > of entities (tasks, cpus) and associate them to a particular resource > control group. > > So they share the CLOSID of the control group and each entity group can > have its own RMID. > > Now you want to be able to move the entity groups around between control > groups without losing the RMID associated to the entity group. > > So the whole picture would look like this: > > rdt -> CTRLGRP -> CLOSID > > mon -> MONGRP -> RMID > > And you want to move MONGRP from one CTRLGRP to another.
Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the same thing. Details below. > > Can you please write up in a abstract way what the design requirements are > that you need. So far we are talking about implementation details and > unspecfied wishlists, but what we really need is an abstract requirement. My pleasure: Design Proposal for Monitoring of RDT Allocation Groups. ----------------------------------------------------------------------------- Currently each CTRLGRP has a unique CLOSID and a (most likely) unique cache bitmask (CBM) per resource. Non-unique CBM are possible although useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. CLOSIDs are much more scarce than RMIDs. If we lift the condition of unique CLOSID, then the user can create multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP would share the CLOSID and RDT_Allocation must maintain the schemata to CLOSID relationship (similarly to what the previous CAT driver used to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as now: adding an element removes it from its previous CTRLGRP. This change would allow further partitioning the allocation groups into (allocation, monitoring) groups as follows: With allocation only: CTRLGRP0 CTRLGRP_ALLOC_ONLY schemata: L3:0=0xff0 L3:0=x00f tasks: PID0 P0_0,P0_1,P1_0,P1_1 cpus: 0x3 0xC If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC independently, with the new model we could create: CTRLGRP0 CTRLGRP1 CTRLGRP2 CTRLGRP3 schemata: L3:0=0xff0 L3:0=x00f L3:0=0x00f L3:0=0x00f tasks: PID0 <none> P0_0,P0_1 P1_0, P1_1 cpus: 0x3 0xC 0x0 0x0 Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0). Now we can ask perf to monitor any of the CTRLGRPs independently -once we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. The perf_event will reserve and assign the RMID to the monitored CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR (CLOSID and RMID), so perf won't have to. If CTRLGRP's schemata changes, the RDT subsystem will find a new CLOSID for the new schemata (potentially reusing an existing one) or fail (just like the old CAT used to). The RMID does not change during schemata updates. If a CTRLGRP dies, the monitoring perf_event continues to exists as a useless wraith, just as happens with cgroup events now. Since CTRLGRPs have no hierarchy. There is no need to handle that in the new RDT Monitoring PMU, greatly simplifying it over the previously proposed versions. A breaking change in user observed behavior with respect to the existing CQM PMU is that there wouldn't be task events. A task must be part of a CTRLGRP and events are created per (CTRLGRP, resource_id) pair. If an user wants to monitor a task across multiple resources (e.g. l3_occupancy across two packages), she must create one event per resource_id and add the two counts. I see this breaking change as an improvement, since hiding the cache topology to user space introduced lots of ugliness and complexity to the CQM PMU without improving accuracy over user space adding the events. Implementation ideas: First idea is to expose one monitoring file per resource in a CTRLGRP, so the list of CTRLGRP's files would be: schemata, tasks, cpus, monitor_l3_0, monitor_l3_1, ... the monitor_<resource_id> file descriptor is passed to perf_event_open in the way cgroup file descriptors are passed now. All events to the same (CTRLGRP,resource_id) share RMID. The RMID allocation part can either be handled by RDT Allocation or by the RDT Monitoring PMU. Either ways, the existence of PMU's perf_events allocates/releases the RMID. Also, since this new design removes hierarchy and task events, it allows for a simple solution of the RMID rotation problem. The removal of task events eliminates the cgroup vs task event conflict existing in the upstream version; it also removes the need to ensure that all active packages have RMIDs at the same time that added complexity to my version of CQM/CMT. Lastly, the removal of hierarchy removes the reliance on cgroups, the complex tree based read, and all the hooks and cgroup files that "raped" the cgroup subsystem. Thoughts? Thanks, David