Il giorno mer 8 set 2021 alle ore 02:11 Shane Synan <digitalcircuit36...@gmail.com> ha scritto: > > On 8/24/21 7:21 PM, Shane Synan wrote: > > The fix hasn't been found, but progress has been made! > > > > After further testing, I think I've found a way to recreate this issue > > with just the router itself, no external USB HDD, no Déjà Dup backup > > over SFTP, and possibly no extra changes beyond a stock NBG6817 > > OpenWRT build (not confirmed as this router runs my home network, > > including SQM QoS, VLANs with another WiFi AP, etc). > > So far, I've attempted all three suggested fixes, but I had trouble > implementing one and I'm unsure if I tried the other two correctly. > Additionally, pinning to "performance" for 1.75 GHz does not solve > the issue either - more on that near the end. > > I've put all of my commits into one branch for easier reference: > https://github.com/openwrt/openwrt/compare/master...digitalcircuit:ft-fix-ipq8065-reset > > And I've used my simplified automatic QA script for verification: > https://github.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset#readme > > (In theory, anyone should be able to reproduce the issue with this > script on a stock OpenWRT build. I'll still do testing with the > Déjà Dup SFTP backup workload.) > > > Suggestions and results in order of attempt: > (Ignore "ipq8065: force CPUs to share DVFS scaling", wrong method.) > > > 1. Raising clock latency (commits with "clock latency" in subject) > > I've tried raising the clock-latency-ns in the ipq8065 DTS by 1000000 > nanoseconds, a deliberately excessive value in the hopes of it being > enough to notice any issues. > > I've tried this for... > > * 1.4 GHz and 1.75 GHz > (ipq8065: raise 1.4 & 1.75 GHz clock latency) > * All CPU frequencies > (previous + ipq8065: raise all clock latency) > * All CPU frequencies and L2 cache latency > (previous + ipq8065: raise L2 cache, CPU core clock latency) > > Unfortunately, as noted in the revert commit, this seemed to have no > impact on the results from the QA script. > > I don't know if I've correctly implemented this suggestion. > > QA script log on b1870c2 (.tar.xz due to 12.2 MiB uncompressed size): > https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-30%2022-37-50%20-%20r17395-b1870c2530-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz > > > 2. Run both cores at the same frequency (most promising?) > > I tried to do this (ipq806x: Force CPU cores to share frequency), but > I think I didn't modify the cpufreq driver in the correct way. > > As noted in the revert commit, this didn't appear to force CPUs to > share frequency, whether manually using the performance governor or > periodically observing the ondemand governor - the CPU cores were at > different frequencies. > > I'll need help figuring out how to implement this in the cpufreq > driver correctly. It seems promising given that in the past, > dual-core bursty workloads didn't seem to trigger the crash. > > NOTE: Before diving into implementing this, read the conclusion below > as I've noticed reboots happen without changing CPU frequency as well. > > I'm also not sure how to debug the cpufreq driver in general. With > dynamic debugging, I can turn on messages about the cpufreq governor, > but I'm not sure of the right way to add dynamic debugging print > messages to the cpufreq driver. > > Example of dynamic debugging: > echo "file drivers/cpufreq/* =p" > /sys/kernel/debug/dynamic_debug/control > > QA script log on 1fdabd9 (.tar.xz due to 4.4 MiB uncompressed size): > https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-31%2020-38-04%20-%20r17397-1fdabd95db-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz > > > 3. Add forced frequency transitions between 1.0 GHz and 1.75 GHz > > I'm not sure if I implemented this correctly. I made a first attempt > (ipq806x: Add transitions to 1.0 <> 1.4 <> 1.75 GHz), but if the > frequency transitions happen, they're too fast to observe. And as > noted above, I'm not yet sure of the right way to add dynamic > debugging messages. > > Running the QA script in "case1" (toggle 800 MHz to 1.75 GHz) still > crashes. > > QA script log on 52f4f77 (.tar.xz due to 471.8 KiB uncompressed size): > https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-09-07%2019-58-07%20-%20r17399-52f4f77518-branch-ft-fix-ipq8065-reset%20-%20reboot%20-%20public.tar.xz > > Separately, I updated the QA script to add a "ramp1" case which > smoothly ramps the CPU core frequency up/down from 600 MHz to > 1.75 GHz, stopping at every frequency in between. Unfortunately, > this still crashes. > > Interestingly, the crash again happens when CPU core frequencies are > distant from each other (1.75 GHz and 800 MHz). This lends credence > to the idea of locking CPU frequencies together for 1.4 and 1.75 GHz. > > QA script log on 11e9380 (.tar.xz due to 5.8 MiB uncompressed size): > (11e938030d is from https://github.com/openwrt/openwrt/pull/4464 ) > https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20ramp1%20-%202021-09-01%2000-53-25%20-%20r17372-11e938030d-branch-fix-ipq8065-dts-opp-order%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz > > > Observations and thoughts: > > My best guess involves having the CPUs match frequency at 1.4 GHz and > above. However, using the "performance" governor at 1.75 GHz does > NOT allow a full Déjà Dup backup to successfully complete - the > router still hard-reboots after up to 8 hours of the intermittent > single-core workload with both core frequencies pinned to 1.75 GHz. > > There may be a combination of issues - CPU frequency shifting and CPU > voltage, perhaps? > > I may need to revisit raising the CPU voltage. I had increased it by > 20000 microvolts in the past for all frequencies without success, but > perhaps it needs raised even higher for 1.4 and/or 1.75 GHz..? > > Context for CPU voltage tests: > https://github.com/openwrt/openwrt/compare/openwrt-21.02...digitalcircuit:openwrt-21.02-cpufreq-dtsivolt-cache-fix-opp-order > (It's in the last two commits; the earlier commits are backporting.) > > I'll need to expand the QA test script to provide a simulated > single-core bursty workload to see if I can make this aspect of the > issue easier for others to reproduce. I planned to do this before > sending this email, but other plans got in the way. > > A wild guess: instability triggered by having one CPU core draw power > for a single-core workload while the other CPU core is idle..? > > > With max frequency set to 1.0 GHz, I haven't observed instability > jumping between 600 MHz and 1.0 GHz. Limiting 'scaling_max_freq' to > 1.0 GHz is a slow workaround on stock firmware for anyone impacted by > this in the way I am (e.g. not being able to run a complete backup). > This is NOT a fix, just sharing in case others would have use for a > workaround meanwhile. > > I'm unsure if I implemented clock latency or CPU transition > frequencies correctly, and I know I didn't implement CPU frequency > matching correctly. > > > Once again, thank you for looking into this! I'll continue > researching and tinkering meanwhile - I've not given up on this saga > yet :) > > Regards, > Shane Synan
Can you try all the above test with the: - l2 scaling disabled - l2 freq set to max - cpu idle 300mhz never enabled? Also a good thing to check would be take the divider clks (get the regulator_summary output and put in a file) _______________________________________________ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel