On 2021-10-26 17:26, Alex Deucher wrote: > On Tue, Oct 19, 2021 at 9:54 AM Luben Tuikov <luben.tui...@amd.com> wrote: >> It again fails with the same message! >> But this time it is different! >> Here's why: >> >> openat(AT_FDCWD, "/sys/class/drm/card0/device/pp_dpm_fclk", O_RDONLY) = 3 >> read(3, "0: 571Mhz \n1: 1274Mhz *\n2: 1221M"..., 8191) = 36 >> read(3, "", 8191) = 0 >> close(3) = 0 >> write(2, "python3: /home/ltuikov/proj/amd/"..., 220python3: >> /home/ltuikov/proj/amd/rocm_smi_lib/src/rocm_smi.cc:913: rsmi_status_t >> get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, >> uint32_t*): Assertion `f->frequency[i-1] <= f->frequency[i]' failed. >> ) = 220 >> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = >> 0x7f531f9bc000 >> rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 >> rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 >> getpid() = 37861 >> gettid() = 37861 >> tgkill(37861, 37861, SIGABRT) = 0 >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 >> --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=37861, si_uid=1000} >> --- >> +++ killed by SIGABRT (core dumped) +++ >> Aborted (core dumped) >> $cat /sys/class/drm/card0/device/pp_dpm_fclk >> 0: 571Mhz >> 1: 1274Mhz * >> 2: 1221Mhz >> $_ >> >> Why is the mid frequency larger than the last? >> Why does get_frequencies() insists that they be ordered when they're not? >> (Does the tool need fixing or the kernel?) >> >> The current patchset doesn't report 0, and doesn't report any current if 0 >> would've been reported as current. But anything else is reported as it >> would've been reported before the patch. And I tested it with vanilla >> amd-staging-drm-next--same thing. >> > Seems to crash both ways. I'd rather either: > 1. Remove the * when the clock is outside of the min and max ranges > or > 2. Clamp the clock to the max or min if it's above or below. > > And then fix the tools accordingly. Those seem like the choices of > least surprise considering the interface is supposed to show the > current and available DPM levels.
So, if we get a case such as the one above, we remove the whole line at 1:, not just the asterisk, right? (for option 1 above). The rocm-smi lib would fail if they're out of order, so only removing the * char would still confuse the tool. What does it mean when the current frequency (line 1:) is above the "max" (line 2:) as shown in my output above? Do we perhaps want to sort them and report current still, and swap lines 1 and 2? Or should we simply remove the ordering requirement in rocm-smi lib? Regards, Luben > > Alex > > >> Regards, >> Luben >> >> >> On 2021-10-19 09:25, Russell, Kent wrote: >> >> [AMD Official Use Only] >> >> >> >> It was the rocm-smi -c flag. Maybe some work was done to make it more >> robust, that would be nice. But the -c flag is supposed to show the current >> frequency for each clock type. -g would do the same, but just for SCLK. >> >> >> >> Kent >> >> >> >> From: Tuikov, Luben <luben.tui...@amd.com> >> Sent: Tuesday, October 19, 2021 12:27 AM >> To: Russell, Kent <kent.russ...@amd.com>; Deucher, Alexander >> <alexander.deuc...@amd.com>; Quan, Evan <evan.q...@amd.com>; Lazar, Lijo >> <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Kasiviswanathan, Harish <harish.kasiviswanat...@amd.com> >> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency >> >> >> >> Kent, >> >> What is the command which fails? >> I can try to duplicate it here. >> >> So far, things I've tried, I cannot make rocm-smi fail. Command arguments? >> >> Regards, >> Luben >> >> On 2021-10-18 21:06, Russell, Kent wrote: >> >> [AMD Official Use Only] >> >> >> >> The * is required for the rocm-smi’s functionality for showing what the >> current clocks are. We had a bug before where the * was removed, then the >> SMI died fantastically. Work could be done to try to handle that type of >> situation, but the SMI has a “show current clocks” and uses the * to >> determine which one is active >> >> >> >> Kent >> >> >> >> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Russell, >> Kent >> Sent: Monday, October 18, 2021 9:05 PM >> To: Tuikov, Luben <luben.tui...@amd.com>; Deucher, Alexander >> <alexander.deuc...@amd.com>; Quan, Evan <evan.q...@amd.com>; Lazar, Lijo >> <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Kasiviswanathan, Harish <harish.kasiviswanat...@amd.com> >> Subject: RE: [PATCH 0/5] 0 MHz is not a valid current frequency >> >> >> >> [AMD Official Use Only] >> >> >> >> +Harish, rocm-smi falls under his purview now. >> >> >> >> Kent >> >> >> >> From: Tuikov, Luben <luben.tui...@amd.com> >> Sent: Monday, October 18, 2021 4:30 PM >> To: Deucher, Alexander <alexander.deuc...@amd.com>; Quan, Evan >> <evan.q...@amd.com>; Lazar, Lijo <lijo.la...@amd.com>; >> amd-gfx@lists.freedesktop.org; Russell, Kent <kent.russ...@amd.com> >> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency >> >> >> >> I think Kent is already seen these patches as he did comment on 1/5 patch. >> >> The v3 version of the patch, posted last week, removes the asterisk to >> report the lowest frequency as the current frequency, when the current >> frequency is 0, i.e. when the block is in low power state. Does the tool >> rely on the asterisk? If this information is necessary could it not use >> amdgpu_pm_info? >> >> Regards, >> Luben >> >> On 2021-10-18 16:19, Deucher, Alexander wrote: >> >> [Public] >> >> >> >> We the current behavior (0 for clock) already crashes the tool, so I don't >> think we can really make things worse. >> >> >> >> Alex >> >> >> >> ________________________________ >> >> From: Quan, Evan <evan.q...@amd.com> >> Sent: Thursday, October 14, 2021 10:25 PM >> To: Lazar, Lijo <lijo.la...@amd.com>; Tuikov, Luben <luben.tui...@amd.com>; >> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Russell, Kent >> <kent.russ...@amd.com> >> Cc: Deucher, Alexander <alexander.deuc...@amd.com> >> Subject: RE: [PATCH 0/5] 0 MHz is not a valid current frequency >> >> >> >> [AMD Official Use Only] >> >> >> >> +Kent who maintains the Rocm tool >> >> >> >> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Lazar, >> Lijo >> Sent: Thursday, October 14, 2021 1:07 AM >> To: Tuikov, Luben <luben.tui...@amd.com>; amd-gfx@lists.freedesktop.org >> Cc: Deucher, Alexander <alexander.deuc...@amd.com> >> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency >> >> >> >> [AMD Official Use Only] >> >> >> >> [AMD Official Use Only] >> >> >> >>> Or maybe just a list without default hint, i.e. no asterisk? >> >> >> I think this is also fine meaning we are having trouble in determining the >> current frequency or DPM level. Evan/Alex? Don't know if this will crash the >> tools. >> >> >> >> Thanks, >> Lijo >> >> ________________________________ >> >> From: Tuikov, Luben <luben.tui...@amd.com> >> Sent: Wednesday, October 13, 2021 9:52:09 PM >> To: Lazar, Lijo <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org >> <amd-gfx@lists.freedesktop.org> >> Cc: Deucher, Alexander <alexander.deuc...@amd.com> >> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency >> >> >> >> On 2021-10-13 00:14, Lazar, Lijo wrote: >>> On 10/13/2021 8:40 AM, Luben Tuikov wrote: >>>> Some ASIC support low-power functionality for the whole ASIC or just >>>> an IP block. When in such low-power mode, some sysfs interfaces would >>>> report a frequency of 0, e.g., >>>> >>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk >>>> 0: 500Mhz >>>> 1: 0Mhz * >>>> 2: 2200Mhz >>>> $_ >>>> >>>> An operating frequency of 0 MHz doesn't make sense, and this interface >>>> is designed to report only operating clock frequencies, i.e. non-zero, >>>> and possibly the current one. >>>> >>>> When in this low-power state, round to the smallest >>>> operating frequency, for this interface, as follows, >>>> >>> Would rather avoid this - >>> >>> 1) It is manipulating FW reported value. If at all there is an uncaught >>> issue in FW reporting of frequency values, that is masked here. >>> 2) Otherwise, if 0MHz is described as GFX power gated case, this >>> provides a convenient interface to check if GFX is power gated. >>> >>> If seeing a '0' is not pleasing, consider changing to something like >>> "NA" - not available (frequency cannot be fetched at the moment). >> There's a ROCm tool which literally asserts if the values are not ordered in >> increasing order. Now since 0 < 550, but 0 is listed as the second entry, >> the tool simply asserts and crashes. >> >> It is not clear what you'd rather see here: >> >> $cat /sys/class/drm/card0/device/pp_dpm_sclk >> 0: 550Mhz >> 1: N/A * >> 2: 2200MHz >> $_ >> >> Is this what you want to see? (That'll crash other tools which expect %uMhz.) >> >> Or maybe just a list without default hint, i.e. no asterisk? >> >> $cat /sys/class/drm/card0/device/pp_dpm_sclk >> 0: 550Mhz >> 1: 2200MHz >> $_ >> >> What should the output be? >> >> We want to avoid showing 0, but still show numbers. >> >> Regards, >> Luben >> >>> Thanks, >>> Lijo >>> >>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk >>>> 0: 500Mhz * >>>> 1: 2200Mhz >>>> $_ >>>> >>>> Luben Tuikov (5): >>>> drm/amd/pm: Slight function rename >>>> drm/amd/pm: Rename cur_value to curr_value >>>> drm/amd/pm: Rename freq_values --> freq_value >>>> dpm/amd/pm: Sienna: 0 MHz is not a current clock frequency >>>> dpm/amd/pm: Navi10: 0 MHz is not a current clock frequency >>>> >>>> .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c | 60 +++++++++------ >>>> .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 73 ++++++++++++------- >>>> 2 files changed, 86 insertions(+), 47 deletions(-) >>>> >> >> >> >> >>