Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

David Carrillo-Cisneros Wed, 18 Jan 2017 18:10:48 -0800

On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <t...@linutronix.de> wrote:
> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>> > > zeros or arbitrary numbers.
>> > >
>> > > Fix: Patches fix this by just throwing an error if the mode is not
>> > > supported.
>> > > The modes supported is task monitoring and cgroup monitoring.
>> > > Also the per package
>> > > data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
>> > > option.
>> > > The systemwide data can be looked up by monitoring root cgroup.
>> >
>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>> > not have asked the question about cpu monitoring. Though I fundamentaly
>> > hate the idea of requiring cgroups for this to work.
>> >
>> > If I just want to look at CPU X why on earth do I have to set up all that
>> > cgroup muck? Just because your main focus is cgroups?
>>
>> The upstream per cpu data is broken because its not overriding the other task
>> event RMIDs on that cpu with the cpu event RMID.
>>
>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>> to the cpu, however then we miss all the task llc_occupancy in that - still
>> evaluating it.
>
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
>
>   - per cpu CLOSID
>   - per task CLOSID
>
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
>
> So if I configure a system by simply partitioning the cache per cpu, which
> is the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple
> way to monitor the occupancy for that reservation.
>
> And looking at that from the CAT point of view, which is the proper way to
> do it, makes it obvious that CQM should be modeled to match CAT.
>
> So lets assume the following:
>
>    CPU 0-3     default CLOSID 0
>    CPU 4               CLOSID 1
>    CPU 5               CLOSID 2
>    CPU 6               CLOSID 3
>    CPU 7               CLOSID 3
>
>    T1                  CLOSID 4
>    T2                  CLOSID 5
>    T3                  CLOSID 6
>    T4                  CLOSID 6
>
>    All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>    they run on.
>
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.
>
> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> care at all about the occupancy of T1 simply because that is running on a
> seperate reservation. Trying to make that an aggregated value in the first
> place is completely wrong. If you want an aggregate, which is pretty much
> useless, then user space tools can generate it easily.
>
> The whole approach you and David have taken is to whack some desired cgroup
> functionality and whatever into CQM without rethinking the overall
> design. And that's fundamentaly broken because it does not take cache (and
> memory bandwidth) allocation into account.
>
> I seriously doubt, that the existing CQM/MBM code can be refactored in any
> useful way. As Peter Zijlstra said before: Remove the existing cruft
> completely and start with completely new design from scratch.
>
> And this new design should start from the allocation angle and then add the
> whole other muck on top so far its possible. Allocation related monitoring
> must be the primary focus, everything else is just tinkering.
>


If in this email you meant "Resource group" where you wrote "CLOSID", then
please disregard my previous email. It seems like a good idea to me to have
a 1:1 mapping between RMIDs and "Resource groups".

The distinction matter because changing the schemata in the resource group
would likely trigger a change of CLOSID, which is useful.

Thanks,
David

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

Reply via email to