On 5/21/21 3:58 PM, Koen Vandeputte wrote:
On 21.05.21 13:19, Ibrahim Tachijian wrote:
Hello,
We use approximately 10k IPQ40XX devices and we have noticed that
every time we run "sysupgrade -n" we lose approximately 1% of the
routers in the process.
After further investigation I'm almost confident that it is not the
sysupgrade process that is the culprit - so what I did was that I put
one test router into a reboot loop.
This is what I do;
Boot the router in a fresh state after a newly installed image.
The image contains a reboot loop that consists of a shell script that
runs every minute.
The shell script tries to run a php-script which simply echoes "Hello
World". If the php-script exists normally then we reboot the router.
However the php-script exists abnormally then the router stops and
does nothing other than informing me that there was a bus-error making
php not able to process the hello world script.
When this process runs the router reboots approximately 50 times
before it boots into a state which is faulty where I see bus-errors
when I try to run php scripts for example.
Looking into dmesg you can see some errors such as,
[10985.209438] SQUASHFS error: squashfs_read_data failed to read block
0x3a803e
[11045.218685] SQUASHFS error: xz decompression failed, data probably
corrupt
[11045.218731] SQUASHFS error: squashfs_read_data failed to read block
0x3a803e
[11105.228157] SQUASHFS error: xz decompression failed, data probably
corrupt
[11105.228203] SQUASHFS error: squashfs_read_data failed to read block
0x3a803e
or
[26218.687905] SQUASHFS error: Unable to read page, block 1b99a, size
10234
[26221.057472] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.057551] SQUASHFS error: Unable to read page, block 1b99a, size
10234
[26221.062926] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.069742] SQUASHFS error: Unable to read page, block 1b99a, size
10234
[26224.460239] SQUASHFS error: Unable to read data cache entry [1b99a]
[26224.460320] SQUASHFS error: Unable to read page, block 1b99a, size
10234
or
[62745.801178] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62773.347234] SQUASHFS error: xz decompression failed, data probably
corrupt
[62773.347281] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62790.132661] SQUASHFS error: xz decompression failed, data probably
corrupt
[62790.132706] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62790.216746] SQUASHFS error: xz decompression failed, data probably
corrupt
[62790.216792] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62800.810525] SQUASHFS error: xz decompression failed, data probably
corrupt
[62800.810570] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62828.336267] SQUASHFS error: xz decompression failed, data probably
corrupt
Now, you would assume that the squashfs-partition is broken - but if
this was the case then a reboot should not help. It does.
Rebooting the router after it boots in this faulty state fixes the issue.
So approximately 1-2% of my reboots make the router go into this
faulty state.
I am clueless on how to further investigate this issue. For now my
work around is restarting the router via a bash script should it
notice there are bus-errors or i/o errors.
Thanks
In the next kernel bump, following patch is also present:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.10.38&id=2ed1d90162a0c0683ecbe0c4802187fa22d641c3
I think it's worth a shot to retry the tests once it's bumped.
Koen
My guess is that the error already happens when reading the flash.
Is your firmware (sysupgrade) bigger than 16MB?
So maybe it has to do with switching to 4-address-mode...
Best,
Vincent
_______________________________________________
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel