Hi Christophe, Christophe Leroy <christophe.le...@csgroup.eu> wrote on Wed, 23 Jun 2021 11:41:46 +0200:
> Le 19/06/2021 à 20:40, Miquel Raynal a écrit : > > Hi Christophe, > > > >>>> Now and then I'm using one of the latest kernels (Today is 5.13-rc6), > >>>> and sometime in one of the 5.x releases, I started to get errors like: > >>>> > >>>> [ 5.098265] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 5.103859] ubi0 warning: ubi_io_read: error -74 (ECC error) while > >>>> reading 60 > >>>> bytes from PEB 99:59824, read only 60 bytes, retry > >>>> [ 5.525843] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 5.531571] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 5.537490] ubi0 warning: ubi_io_read: error -74 (ECC error) while > >>>> reading 30 > >>>> 73 bytes from PEB 107:108976, read only 3073 bytes, retry > >>>> [ 5.691121] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 5.696709] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 5.702426] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 5.708141] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 5.714103] ubi0 warning: ubi_io_read: error -74 (ECC error) while > >>>> reading 30 > >>>> 35 bytes from PEB 107:25144, read only 3035 bytes, retry > >>>> [ 20.523689] random: crng init done > >>>> [ 21.892130] ecc_sw_hamming_correct: uncorrectable ECC error > >>>> [ 21.897730] ubi0 warning: ubi_io_read: error -74 (ECC error) while > >>>> reading 13 > >>>> 94 bytes from PEB 116:75776, read only 1394 bytes, retry > >>>> > >>>> Most of the time, when the reading of the file fails, I just have to > >>>> read it once more and it gets read without that error. > >>> > >>> It really looks like a regular bitflip happening "sometimes". Is this a > >>> board which already had a life? What are the usage counters (UBI should > >>> tell you this) compared to the official endurance of your chip (see the > >>> datasheet)? > >> > >> The board had a peacefull life: > >> > >> UBI reports "ubi0: max/mean erase counter: 49/20, WL threshold: 4096" > > > > Mmmh. Indeed. > > > >> > >> I have tried with half a dozen of boards and all have the issue. > >> > >>> >>>> What am I supposed to do to avoid the ECC weakness warning at > >>> startup and to fix that ECC error issue ? > >>> > >>> I honestly don't think the errors come from the 5.1x kernels given the > >>> above logs. If you flash back your old 4.14 I am pretty sure you'll > >>> have the same errors at some point. > >> > >> I don't have any problem like that with 4.14 with any of the board. > >> > >> When booting a 4.14 kernel I don't get any problem on the same board. > >> > > > > If you can reliably show that when returning to a 4.14 kernel the ECC > > weakness disappears, then there is certainly something new. What driver > > are you using? Maybe you can do a bisection? > > Using the GPIO driver, and the NAND chip is a HYNIX. > > I can say that the ECC weakness doesn't exist until v5.5 included. The > weakness appears with v5.6. > > I have tried bisection between those two versions and I couldn't end up to a > reliable result. The closer the v5.5 you go, the more difficult it is to > reproduce the issue. > > So I looked at what was done around the places, and in fact that's mainly > optimisation in the powerpc code. It seems that the more powerpc is > optimised, the more the problem occurs. > > Looking at the GPIO nand driver, I saw that no-op gpio_nand_dosync() > function. By adding a memory barrier in that function, the ECC weakness > disappeared completely. I see that the 'fix' in gpio_nand_dosync() has only been designed for ARM platforms, perhaps it would make sense to have a PPC variant here? > Not sure what the final solution has to be. Perhaps PowerPC maintainers can sched some light on these findings? Thanks, Miquèl