On 2021-10-26 17:26, Alex Deucher wrote:
> On Tue, Oct 19, 2021 at 9:54 AM Luben Tuikov <luben.tui...@amd.com> wrote:
>> It again fails with the same message!
>> But this time it is different!
>> Here's why:
>>
>> openat(AT_FDCWD, "/sys/class/drm/card0/device/pp_dpm_fclk", O_RDONLY) = 3
>> read(3, "0: 571Mhz \n1: 1274Mhz *\n2: 1221M"..., 8191) = 36
>> read(3, "", 8191)                       = 0
>> close(3)                                = 0
>> write(2, "python3: /home/ltuikov/proj/amd/"..., 220python3: 
>> /home/ltuikov/proj/amd/rocm_smi_lib/src/rocm_smi.cc:913: rsmi_status_t 
>> get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, 
>> uint32_t*): Assertion `f->frequency[i-1] <= f->frequency[i]' failed.
>> ) = 220
>> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
>> 0x7f531f9bc000
>> rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
>> getpid()                                = 37861
>> gettid()                                = 37861
>> tgkill(37861, 37861, SIGABRT)           = 0
>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>> --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=37861, si_uid=1000} 
>> ---
>> +++ killed by SIGABRT (core dumped) +++
>> Aborted (core dumped)
>> $cat /sys/class/drm/card0/device/pp_dpm_fclk
>> 0: 571Mhz
>> 1: 1274Mhz *
>> 2: 1221Mhz
>> $_
>>
>> Why is the mid frequency larger than the last?
>> Why does get_frequencies() insists that they be ordered when they're not? 
>> (Does the tool need fixing or the kernel?)
>>
>> The current patchset doesn't report 0, and doesn't report any current if 0 
>> would've been reported as current. But anything else is reported as it 
>> would've been reported before the patch. And I tested it with vanilla 
>> amd-staging-drm-next--same thing.
>>
> Seems to crash both ways.  I'd rather either:
> 1. Remove the * when the clock is outside of the min and max ranges
> or
> 2. Clamp the clock to the max or min if it's above or below.
>
> And then fix the tools accordingly.  Those seem like the choices of
> least surprise considering the interface is supposed to show the
> current and available DPM levels.

So, if we get a case such as the one above, we remove the whole line at 1:, not 
just the asterisk, right? (for option 1 above).
The rocm-smi lib would fail if they're out of order, so only removing the * 
char would still confuse the tool.

What does it mean when the current frequency (line 1:) is above the "max" (line 
2:) as shown in my output above?

Do we perhaps want to sort them and report current still, and swap lines 1 and 
2?

Or should we simply remove the ordering requirement in rocm-smi lib?

Regards,
Luben

>
> Alex
>
>
>> Regards,
>> Luben
>>
>>
>> On 2021-10-19 09:25, Russell, Kent wrote:
>>
>> [AMD Official Use Only]
>>
>>
>>
>> It was the rocm-smi -c flag. Maybe some work was done to make it more 
>> robust, that would be nice. But the -c flag is supposed to show the current 
>> frequency for each clock type. -g would do the same, but just for SCLK.
>>
>>
>>
>> Kent
>>
>>
>>
>> From: Tuikov, Luben <luben.tui...@amd.com>
>> Sent: Tuesday, October 19, 2021 12:27 AM
>> To: Russell, Kent <kent.russ...@amd.com>; Deucher, Alexander 
>> <alexander.deuc...@amd.com>; Quan, Evan <evan.q...@amd.com>; Lazar, Lijo 
>> <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Kasiviswanathan, Harish <harish.kasiviswanat...@amd.com>
>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>
>>
>>
>> Kent,
>>
>> What is the command which fails?
>> I can try to duplicate it here.
>>
>> So far, things I've tried, I cannot make rocm-smi fail. Command arguments?
>>
>> Regards,
>> Luben
>>
>> On 2021-10-18 21:06, Russell, Kent wrote:
>>
>> [AMD Official Use Only]
>>
>>
>>
>> The * is required for the rocm-smi’s functionality for showing what the 
>> current clocks are. We had a bug before where the * was removed, then the 
>> SMI died fantastically. Work could be done to try to handle that type of 
>> situation, but the SMI has a “show current clocks” and uses the * to 
>> determine which one is active
>>
>>
>>
>> Kent
>>
>>
>>
>> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Russell, 
>> Kent
>> Sent: Monday, October 18, 2021 9:05 PM
>> To: Tuikov, Luben <luben.tui...@amd.com>; Deucher, Alexander 
>> <alexander.deuc...@amd.com>; Quan, Evan <evan.q...@amd.com>; Lazar, Lijo 
>> <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Kasiviswanathan, Harish <harish.kasiviswanat...@amd.com>
>> Subject: RE: [PATCH 0/5] 0 MHz is not a valid current frequency
>>
>>
>>
>> [AMD Official Use Only]
>>
>>
>>
>> +Harish, rocm-smi falls under his purview now.
>>
>>
>>
>> Kent
>>
>>
>>
>> From: Tuikov, Luben <luben.tui...@amd.com>
>> Sent: Monday, October 18, 2021 4:30 PM
>> To: Deucher, Alexander <alexander.deuc...@amd.com>; Quan, Evan 
>> <evan.q...@amd.com>; Lazar, Lijo <lijo.la...@amd.com>; 
>> amd-gfx@lists.freedesktop.org; Russell, Kent <kent.russ...@amd.com>
>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>
>>
>>
>> I think Kent is already seen these patches as he did comment on 1/5 patch.
>>
>> The v3 version of the patch, posted last week, removes the asterisk to 
>> report the lowest frequency as the current frequency, when the current 
>> frequency is 0, i.e. when the block is in low power state. Does the tool 
>> rely on the asterisk? If this information is necessary could it not use 
>> amdgpu_pm_info?
>>
>> Regards,
>> Luben
>>
>> On 2021-10-18 16:19, Deucher, Alexander wrote:
>>
>> [Public]
>>
>>
>>
>> We the current behavior (0 for clock) already crashes the tool, so I don't 
>> think we can really make things worse.
>>
>>
>>
>> Alex
>>
>>
>>
>> ________________________________
>>
>> From: Quan, Evan <evan.q...@amd.com>
>> Sent: Thursday, October 14, 2021 10:25 PM
>> To: Lazar, Lijo <lijo.la...@amd.com>; Tuikov, Luben <luben.tui...@amd.com>; 
>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Russell, Kent 
>> <kent.russ...@amd.com>
>> Cc: Deucher, Alexander <alexander.deuc...@amd.com>
>> Subject: RE: [PATCH 0/5] 0 MHz is not a valid current frequency
>>
>>
>>
>> [AMD Official Use Only]
>>
>>
>>
>> +Kent who maintains the Rocm tool
>>
>>
>>
>> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Lazar, 
>> Lijo
>> Sent: Thursday, October 14, 2021 1:07 AM
>> To: Tuikov, Luben <luben.tui...@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <alexander.deuc...@amd.com>
>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>
>>
>>
>> [AMD Official Use Only]
>>
>>
>>
>> [AMD Official Use Only]
>>
>>
>>
>>> Or maybe just a list without default hint, i.e. no asterisk?
>>
>>
>> I think this is also fine meaning we are having trouble in determining the 
>> current frequency or DPM level. Evan/Alex? Don't know if this will crash the 
>> tools.
>>
>>
>>
>> Thanks,
>> Lijo
>>
>> ________________________________
>>
>> From: Tuikov, Luben <luben.tui...@amd.com>
>> Sent: Wednesday, October 13, 2021 9:52:09 PM
>> To: Lazar, Lijo <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org 
>> <amd-gfx@lists.freedesktop.org>
>> Cc: Deucher, Alexander <alexander.deuc...@amd.com>
>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>
>>
>>
>> On 2021-10-13 00:14, Lazar, Lijo wrote:
>>> On 10/13/2021 8:40 AM, Luben Tuikov wrote:
>>>> Some ASIC support low-power functionality for the whole ASIC or just
>>>> an IP block. When in such low-power mode, some sysfs interfaces would
>>>> report a frequency of 0, e.g.,
>>>>
>>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>>>> 0: 500Mhz
>>>> 1: 0Mhz *
>>>> 2: 2200Mhz
>>>> $_
>>>>
>>>> An operating frequency of 0 MHz doesn't make sense, and this interface
>>>> is designed to report only operating clock frequencies, i.e. non-zero,
>>>> and possibly the current one.
>>>>
>>>> When in this low-power state, round to the smallest
>>>> operating frequency, for this interface, as follows,
>>>>
>>> Would rather avoid this -
>>>
>>> 1) It is manipulating FW reported value. If at all there is an uncaught
>>> issue in FW reporting of frequency values, that is masked here.
>>> 2) Otherwise, if 0MHz is described as GFX power gated case, this
>>> provides a convenient interface to check if GFX is power gated.
>>>
>>> If seeing a '0' is not pleasing, consider changing to something like
>>>        "NA" - not available (frequency cannot be fetched at the moment).
>> There's a ROCm tool which literally asserts if the values are not ordered in 
>> increasing order. Now since 0 < 550, but 0 is listed as the second entry, 
>> the tool simply asserts and crashes.
>>
>> It is not clear what you'd rather see here:
>>
>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>> 0: 550Mhz
>> 1: N/A *
>> 2: 2200MHz
>> $_
>>
>> Is this what you want to see? (That'll crash other tools which expect %uMhz.)
>>
>> Or maybe just a list without default hint, i.e. no asterisk?
>>
>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>> 0: 550Mhz
>> 1: 2200MHz
>> $_
>>
>> What should the output be?
>>
>> We want to avoid showing 0, but still show numbers.
>>
>> Regards,
>> Luben
>>
>>> Thanks,
>>> Lijo
>>>
>>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>>>> 0: 500Mhz *
>>>> 1: 2200Mhz
>>>> $_
>>>>
>>>> Luben Tuikov (5):
>>>>    drm/amd/pm: Slight function rename
>>>>    drm/amd/pm: Rename cur_value to curr_value
>>>>    drm/amd/pm: Rename freq_values --> freq_value
>>>>    dpm/amd/pm: Sienna: 0 MHz is not a current clock frequency
>>>>    dpm/amd/pm: Navi10: 0 MHz is not a current clock frequency
>>>>
>>>>   .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 60 +++++++++------
>>>>   .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 73 ++++++++++++-------
>>>>   2 files changed, 86 insertions(+), 47 deletions(-)
>>>>
>>
>>
>>
>>
>>

Reply via email to