Hello,

We use approximately 10k IPQ40XX devices and we have noticed that
every time we run "sysupgrade -n" we lose approximately 1% of the
routers in the process.
After further investigation I'm almost confident that it is not the
sysupgrade process that is the culprit - so what I did was that I put
one test router into a reboot loop.

This is what I do;

Boot the router in a fresh state after a newly installed image.
The image contains a reboot loop that consists of a shell script that
runs every minute.

The shell script tries to run a php-script which simply echoes "Hello
World". If the php-script exists normally then we reboot the router.

However the php-script exists abnormally then the router stops and
does nothing other than informing me that there was a bus-error making
php not able to process the hello world script.

When this process runs the router reboots approximately 50 times
before it boots into a state which is faulty where I see bus-errors
when I try to run php scripts for example.


Looking into dmesg you can see some errors such as,

[10985.209438] SQUASHFS error: squashfs_read_data failed to read block 0x3a803e
[11045.218685] SQUASHFS error: xz decompression failed, data probably corrupt
[11045.218731] SQUASHFS error: squashfs_read_data failed to read block 0x3a803e
[11105.228157] SQUASHFS error: xz decompression failed, data probably corrupt
[11105.228203] SQUASHFS error: squashfs_read_data failed to read block 0x3a803e

or

[26218.687905] SQUASHFS error: Unable to read page, block 1b99a, size 10234
[26221.057472] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.057551] SQUASHFS error: Unable to read page, block 1b99a, size 10234
[26221.062926] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.069742] SQUASHFS error: Unable to read page, block 1b99a, size 10234
[26224.460239] SQUASHFS error: Unable to read data cache entry [1b99a]
[26224.460320] SQUASHFS error: Unable to read page, block 1b99a, size 10234

or

[62745.801178] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2
[62773.347234] SQUASHFS error: xz decompression failed, data probably corrupt
[62773.347281] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2
[62790.132661] SQUASHFS error: xz decompression failed, data probably corrupt
[62790.132706] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2
[62790.216746] SQUASHFS error: xz decompression failed, data probably corrupt
[62790.216792] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2
[62800.810525] SQUASHFS error: xz decompression failed, data probably corrupt
[62800.810570] SQUASHFS error: squashfs_read_data failed to read block 0x732ae2
[62828.336267] SQUASHFS error: xz decompression failed, data probably corrupt



Now, you would assume that the squashfs-partition is broken - but if
this was the case then a reboot should not help. It does.
Rebooting the router after it boots in this faulty state fixes the issue.

So approximately 1-2% of my reboots make the router go into this faulty state.

I am clueless on how to further investigate this issue. For now my
work around is restarting the router via a bash script should it
notice there are bus-errors or i/o errors.

Thanks


-- 
Ibrahim Tachijian

_______________________________________________
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel

Reply via email to