On 5/21/21 3:58 PM, Koen Vandeputte wrote:

On 21.05.21 13:19, Ibrahim Tachijian wrote:
Hello,

We use approximately 10k IPQ40XX devices and we have noticed that
every time we run "sysupgrade -n" we lose approximately 1% of the
routers in the process.
After further investigation I'm almost confident that it is not the
sysupgrade process that is the culprit - so what I did was that I put
one test router into a reboot loop.

This is what I do;

Boot the router in a fresh state after a newly installed image.
The image contains a reboot loop that consists of a shell script that
runs every minute.

The shell script tries to run a php-script which simply echoes "Hello
World". If the php-script exists normally then we reboot the router.

However the php-script exists abnormally then the router stops and
does nothing other than informing me that there was a bus-error making
php not able to process the hello world script.

When this process runs the router reboots approximately 50 times
before it boots into a state which is faulty where I see bus-errors
when I try to run php scripts for example.


Looking into dmesg you can see some errors such as,

[10985.209438] SQUASHFS error: squashfs_read_data failed to read block 0x3a803e [11045.218685] SQUASHFS error: xz decompression failed, data probably corrupt [11045.218731] SQUASHFS error: squashfs_read_data failed to read block 0x3a803e [11105.228157] SQUASHFS error: xz decompression failed, data probably corrupt [11105.228203] SQUASHFS error: squashfs_read_data failed to read block 0x3a803e

or

[26218.687905] SQUASHFS error: Unable to read page, block 1b99a, size 10234
[26221.057472] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.057551] SQUASHFS error: Unable to read page, block 1b99a, size 10234
[26221.062926] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.069742] SQUASHFS error: Unable to read page, block 1b99a, size 10234
[26224.460239] SQUASHFS error: Unable to read data cache entry [1b99a]
[26224.460320] SQUASHFS error: Unable to read page, block 1b99a, size 10234

or

[62745.801178] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2 [62773.347234] SQUASHFS error: xz decompression failed, data probably corrupt [62773.347281] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2 [62790.132661] SQUASHFS error: xz decompression failed, data probably corrupt [62790.132706] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2 [62790.216746] SQUASHFS error: xz decompression failed, data probably corrupt [62790.216792] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2 [62800.810525] SQUASHFS error: xz decompression failed, data probably corrupt [62800.810570] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2 [62828.336267] SQUASHFS error: xz decompression failed, data probably corrupt



Now, you would assume that the squashfs-partition is broken - but if
this was the case then a reboot should not help. It does.
Rebooting the router after it boots in this faulty state fixes the issue.

So approximately 1-2% of my reboots make the router go into this faulty state.

I am clueless on how to further investigate this issue. For now my
work around is restarting the router via a bash script should it
notice there are bus-errors or i/o errors.

Thanks

In the next kernel bump, following patch is also present:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.10.38&id=2ed1d90162a0c0683ecbe0c4802187fa22d641c3

I think it's worth a shot to retry the tests once it's bumped.

Koen


My guess is that the error already happens when reading the flash.
Is your firmware (sysupgrade) bigger than 16MB?
So maybe it has to do with switching to 4-address-mode...

Best,

Vincent

_______________________________________________
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel

Reply via email to