On 2/20/25 10:27, Lucas De Marchi wrote:
> On Thu, Feb 20, 2025 at 08:28:01AM -0800, Dave Hansen wrote:
>> On 2/20/25 07:36, Lucas De Marchi wrote:
>>> On some boots the read of MSR_PP1_ENERGY_STATUS msr returns 0, causing
>>> perf_msr_probe() to make the power/events/energy-gpu event non-visible.
>>> When that happens, the msr always read 0 until the graphics module (i915
>>> for Meteor Lake, xe for Lunar Lake) is loaded. Then it starts returning
>>> something different and re-loading the rapl module "fixes" it.
>>
>> What's the root cause here? Did the kernel do something funky? Or is
>> this a hardware bug?
>
> From what I can see, the kernel is reading the value and deciding that "if
> it's 0, it doesn't really have that", which is not really true. For
> these platforms sometimes it keeps returning 0 until the gpu is
> later powered on, which only happens when xe / i915 probes.
>
> But what I don't really understand is why the behavior changes from one
> boot to another. I'm assuming it depends on some funky firmware
> behavior.
Could we root cause this a _bit_ better, please?
Right now, it seems like you noted some weird behavior on one out of the
22 "model_skl" CPUs. You then tested on at least 4 of those CPUs and
found similar behavior. So, you copied, verbatim, the
intel_rapl_skl_msrs and model_skl structures. Then, flipped the
perf_msr->no_check bit for one of the 5 MSRs. There's no note on why the
one bit got flipped or that it's a presumed CPU issue.
To continue the trajectory that this patch sets us on, each CPU model
that comes out needs to be tested. When a new CPU shows up, which one is
it? "model_skl" with the (presumed) CPU bug fixed or "model_rpl"
without? How would someone even know how to test it? It's certainly not
documented in the code.
I don't think that's a sustainable trajectory.
We need to figure out whether the kernel is buggy or the hardware is buggy.
If the hardware is buggy, we need to go ask the hardware guys to publish
an erratum about the bug so there are *bounds* on where the issue shows
up. Basically make the hardware guys document the nasty behavior instead
of having us test every CPU.
Or, if we simply can't trust MSR_PP1_ENERGY_STATUS, let's just do the
attached patch. What's the downside on a non-buggy CPU of doing this?
diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index d3bb3865c1b1f..5bf7c68696f33 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -580,7 +580,7 @@ static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PP0] = { MSR_PP0_ENERGY_STATUS, &rapl_events_cores_group, test_msr, false, RAPL_MSR_MASK },
[PERF_RAPL_PKG] = { MSR_PKG_ENERGY_STATUS, &rapl_events_pkg_group, test_msr, false, RAPL_MSR_MASK },
[PERF_RAPL_RAM] = { MSR_DRAM_ENERGY_STATUS, &rapl_events_ram_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_PP1] = { MSR_PP1_ENERGY_STATUS, &rapl_events_gpu_group, test_msr, false, RAPL_MSR_MASK },
+ [PERF_RAPL_PP1] = { MSR_PP1_ENERGY_STATUS, &rapl_events_gpu_group, test_msr, true, RAPL_MSR_MASK },
[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group, test_msr, false, RAPL_MSR_MASK },
};