On 5/23/21 10:21 AM, Ibrahim Tachijian wrote:
Is your firmware (sysupgrade) bigger than 16MB?
No, the sysupgrade file is currently 13MB.
So maybe it has to do with switching to 4-address-mode...
What is this exactly?
The 4-"byte"-address mode is used on 32 MiB flash chips.
We had similar issues with other 32 MiB devices in the past
which were fixed at some point by Felix Fietkau.
My guess is that the error already happens when reading the flash.
At least we know that the flash is not being written to incorrectly
since after a reboot the flash is intact and does not produce any
errors. It's simply random if the system boots into this "faulty
state" or not (happens approx 1-2% of the time).
Does anyone maybe know how I can re-read the squashfs partition and
verify the integrity while the system is booted to see if I encounter
the squashfs errors.
I'm really at a loss here - no idea where to even look into diagnosing
the issue.
I guess the reset line of the flash chip is not hold long enough so
that it is in an unclean state. I think the reset duration during
booting needs to be increased. But I don't know the code and can't point
you there. It's just a guess...
On Fri, May 21, 2021 at 6:16 PM Vincent Wiemann
<vincent.wiem...@ironai.com> wrote:
On 5/21/21 3:58 PM, Koen Vandeputte wrote:
On 21.05.21 13:19, Ibrahim Tachijian wrote:
Hello,
We use approximately 10k IPQ40XX devices and we have noticed that
every time we run "sysupgrade -n" we lose approximately 1% of the
routers in the process.
After further investigation I'm almost confident that it is not the
sysupgrade process that is the culprit - so what I did was that I put
one test router into a reboot loop.
This is what I do;
Boot the router in a fresh state after a newly installed image.
The image contains a reboot loop that consists of a shell script that
runs every minute.
The shell script tries to run a php-script which simply echoes "Hello
World". If the php-script exists normally then we reboot the router.
However the php-script exists abnormally then the router stops and
does nothing other than informing me that there was a bus-error making
php not able to process the hello world script.
When this process runs the router reboots approximately 50 times
before it boots into a state which is faulty where I see bus-errors
when I try to run php scripts for example.
Looking into dmesg you can see some errors such as,
[10985.209438] SQUASHFS error: squashfs_read_data failed to read block
0x3a803e
[11045.218685] SQUASHFS error: xz decompression failed, data probably
corrupt
[11045.218731] SQUASHFS error: squashfs_read_data failed to read block
0x3a803e
[11105.228157] SQUASHFS error: xz decompression failed, data probably
corrupt
[11105.228203] SQUASHFS error: squashfs_read_data failed to read block
0x3a803e
or
[26218.687905] SQUASHFS error: Unable to read page, block 1b99a, size
10234
[26221.057472] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.057551] SQUASHFS error: Unable to read page, block 1b99a, size
10234
[26221.062926] SQUASHFS error: Unable to read data cache entry [1b99a]
[26221.069742] SQUASHFS error: Unable to read page, block 1b99a, size
10234
[26224.460239] SQUASHFS error: Unable to read data cache entry [1b99a]
[26224.460320] SQUASHFS error: Unable to read page, block 1b99a, size
10234
or
[62745.801178] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62773.347234] SQUASHFS error: xz decompression failed, data probably
corrupt
[62773.347281] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62790.132661] SQUASHFS error: xz decompression failed, data probably
corrupt
[62790.132706] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62790.216746] SQUASHFS error: xz decompression failed, data probably
corrupt
[62790.216792] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62800.810525] SQUASHFS error: xz decompression failed, data probably
corrupt
[62800.810570] SQUASHFS error: squashfs_read_data failed to read block
0x732ae2
[62828.336267] SQUASHFS error: xz decompression failed, data probably
corrupt
Now, you would assume that the squashfs-partition is broken - but if
this was the case then a reboot should not help. It does.
Rebooting the router after it boots in this faulty state fixes the issue.
So approximately 1-2% of my reboots make the router go into this
faulty state.
I am clueless on how to further investigate this issue. For now my
work around is restarting the router via a bash script should it
notice there are bus-errors or i/o errors.
Thanks
In the next kernel bump, following patch is also present:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.10.38&id=2ed1d90162a0c0683ecbe0c4802187fa22d641c3
I think it's worth a shot to retry the tests once it's bumped.
Koen
My guess is that the error already happens when reading the flash.
Is your firmware (sysupgrade) bigger than 16MB?
So maybe it has to do with switching to 4-address-mode...
Best,
Vincent
_______________________________________________
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel
--
Ibrahim Tachijian
_______________________________________________
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel