On 8/24/21 7:21 PM, Shane Synan wrote: > The fix hasn't been found, but progress has been made! > > After further testing, I think I've found a way to recreate this issue > with just the router itself, no external USB HDD, no Déjà Dup backup > over SFTP, and possibly no extra changes beyond a stock NBG6817 > OpenWRT build (not confirmed as this router runs my home network, > including SQM QoS, VLANs with another WiFi AP, etc).
So far, I've attempted all three suggested fixes, but I had trouble implementing one and I'm unsure if I tried the other two correctly. Additionally, pinning to "performance" for 1.75 GHz does not solve the issue either - more on that near the end. I've put all of my commits into one branch for easier reference: https://github.com/openwrt/openwrt/compare/master...digitalcircuit:ft-fix-ipq8065-reset And I've used my simplified automatic QA script for verification: https://github.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset#readme (In theory, anyone should be able to reproduce the issue with this script on a stock OpenWRT build. I'll still do testing with the Déjà Dup SFTP backup workload.) Suggestions and results in order of attempt: (Ignore "ipq8065: force CPUs to share DVFS scaling", wrong method.) 1. Raising clock latency (commits with "clock latency" in subject) I've tried raising the clock-latency-ns in the ipq8065 DTS by 1000000 nanoseconds, a deliberately excessive value in the hopes of it being enough to notice any issues. I've tried this for... * 1.4 GHz and 1.75 GHz (ipq8065: raise 1.4 & 1.75 GHz clock latency) * All CPU frequencies (previous + ipq8065: raise all clock latency) * All CPU frequencies and L2 cache latency (previous + ipq8065: raise L2 cache, CPU core clock latency) Unfortunately, as noted in the revert commit, this seemed to have no impact on the results from the QA script. I don't know if I've correctly implemented this suggestion. QA script log on b1870c2 (.tar.xz due to 12.2 MiB uncompressed size): https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-30%2022-37-50%20-%20r17395-b1870c2530-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz 2. Run both cores at the same frequency (most promising?) I tried to do this (ipq806x: Force CPU cores to share frequency), but I think I didn't modify the cpufreq driver in the correct way. As noted in the revert commit, this didn't appear to force CPUs to share frequency, whether manually using the performance governor or periodically observing the ondemand governor - the CPU cores were at different frequencies. I'll need help figuring out how to implement this in the cpufreq driver correctly. It seems promising given that in the past, dual-core bursty workloads didn't seem to trigger the crash. NOTE: Before diving into implementing this, read the conclusion below as I've noticed reboots happen without changing CPU frequency as well. I'm also not sure how to debug the cpufreq driver in general. With dynamic debugging, I can turn on messages about the cpufreq governor, but I'm not sure of the right way to add dynamic debugging print messages to the cpufreq driver. Example of dynamic debugging: echo "file drivers/cpufreq/* =p" > /sys/kernel/debug/dynamic_debug/control QA script log on 1fdabd9 (.tar.xz due to 4.4 MiB uncompressed size): https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-08-31%2020-38-04%20-%20r17397-1fdabd95db-branch-ft-fix-ipq8065-reset%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz 3. Add forced frequency transitions between 1.0 GHz and 1.75 GHz I'm not sure if I implemented this correctly. I made a first attempt (ipq806x: Add transitions to 1.0 <> 1.4 <> 1.75 GHz), but if the frequency transitions happen, they're too fast to observe. And as noted above, I'm not yet sure of the right way to add dynamic debugging messages. Running the QA script in "case1" (toggle 800 MHz to 1.75 GHz) still crashes. QA script log on 52f4f77 (.tar.xz due to 471.8 KiB uncompressed size): https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20case1%20-%202021-09-07%2019-58-07%20-%20r17399-52f4f77518-branch-ft-fix-ipq8065-reset%20-%20reboot%20-%20public.tar.xz Separately, I updated the QA script to add a "ramp1" case which smoothly ramps the CPU core frequency up/down from 600 MHz to 1.75 GHz, stopping at every frequency in between. Unfortunately, this still crashes. Interestingly, the crash again happens when CPU core frequencies are distant from each other (1.75 GHz and 800 MHz). This lends credence to the idea of locking CPU frequencies together for 1.4 and 1.75 GHz. QA script log on 11e9380 (.tar.xz due to 5.8 MiB uncompressed size): (11e938030d is from https://github.com/openwrt/openwrt/pull/4464 ) https://zorro.casa/sync/Hosting/Utilities/Development/OpenWRT/mailing-list/ipq806x_%20backport%20cpufreq%20changes%20to%205.4/debug-cpufreq%20-%20clock%20default%20test%20ramp1%20-%202021-09-01%2000-53-25%20-%20r17372-11e938030d-branch-fix-ipq8065-dts-opp-order%20-%20date%20segfault%2C%20reboot%20-%20public.tar.xz Observations and thoughts: My best guess involves having the CPUs match frequency at 1.4 GHz and above. However, using the "performance" governor at 1.75 GHz does NOT allow a full Déjà Dup backup to successfully complete - the router still hard-reboots after up to 8 hours of the intermittent single-core workload with both core frequencies pinned to 1.75 GHz. There may be a combination of issues - CPU frequency shifting and CPU voltage, perhaps? I may need to revisit raising the CPU voltage. I had increased it by 20000 microvolts in the past for all frequencies without success, but perhaps it needs raised even higher for 1.4 and/or 1.75 GHz..? Context for CPU voltage tests: https://github.com/openwrt/openwrt/compare/openwrt-21.02...digitalcircuit:openwrt-21.02-cpufreq-dtsivolt-cache-fix-opp-order (It's in the last two commits; the earlier commits are backporting.) I'll need to expand the QA test script to provide a simulated single-core bursty workload to see if I can make this aspect of the issue easier for others to reproduce. I planned to do this before sending this email, but other plans got in the way. A wild guess: instability triggered by having one CPU core draw power for a single-core workload while the other CPU core is idle..? With max frequency set to 1.0 GHz, I haven't observed instability jumping between 600 MHz and 1.0 GHz. Limiting 'scaling_max_freq' to 1.0 GHz is a slow workaround on stock firmware for anyone impacted by this in the way I am (e.g. not being able to run a complete backup). This is NOT a fix, just sharing in case others would have use for a workaround meanwhile. I'm unsure if I implemented clock latency or CPU transition frequencies correctly, and I know I didn't implement CPU frequency matching correctly. Once again, thank you for looking into this! I'll continue researching and tinkering meanwhile - I've not given up on this saga yet :) Regards, Shane Synan _______________________________________________ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel