Hi Christophe,

Christophe Leroy <christophe.le...@csgroup.eu> wrote on Wed, 23 Jun
2021 11:41:46 +0200:

> Le 19/06/2021 à 20:40, Miquel Raynal a écrit :
> > Hi Christophe,
> >   
> >>>> Now and then I'm using one of the latest kernels (Today is 5.13-rc6), 
> >>>> and sometime in one of the 5.x releases, I started to get errors like:
> >>>>
> >>>> [    5.098265] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.103859] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
> >>>> reading 60
> >>>>     bytes from PEB 99:59824, read only 60 bytes, retry
> >>>> [    5.525843] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.531571] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.537490] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
> >>>> reading 30
> >>>> 73 bytes from PEB 107:108976, read only 3073 bytes, retry
> >>>> [    5.691121] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.696709] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.702426] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.708141] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.714103] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
> >>>> reading 30
> >>>> 35 bytes from PEB 107:25144, read only 3035 bytes, retry
> >>>> [   20.523689] random: crng init done
> >>>> [   21.892130] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [   21.897730] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
> >>>> reading 13
> >>>> 94 bytes from PEB 116:75776, read only 1394 bytes, retry
> >>>>
> >>>> Most of the time, when the reading of the file fails, I just have to 
> >>>> read it once more and it gets read without that error.  
> >>>
> >>> It really looks like a regular bitflip happening "sometimes". Is this a
> >>> board which already had a life? What are the usage counters (UBI should
> >>> tell you this) compared to the official endurance of your chip (see the
> >>> datasheet)?  
> >>
> >> The board had a peacefull life:
> >>
> >> UBI reports "ubi0: max/mean erase counter: 49/20, WL threshold: 4096"  
> > 
> > Mmmh. Indeed.
> >   
> >>
> >> I have tried with half a dozen of boards and all have the issue.
> >>  
> >>>    >>>> What am I supposed to do to avoid the ECC weakness warning at 
> >>> startup and to fix that ECC error issue ?  
> >>>
> >>> I honestly don't think the errors come from the 5.1x kernels given the
> >>> above logs. If you flash back your old 4.14 I am pretty sure you'll
> >>> have the same errors at some point.  
> >>
> >> I don't have any problem like that with 4.14 with any of the board.
> >>
> >> When booting a 4.14 kernel I don't get any problem on the same board.
> >>  
> > 
> > If you can reliably show that when returning to a 4.14 kernel the ECC
> > weakness disappears, then there is certainly something new. What driver
> > are you using? Maybe you can do a bisection?  
> 
> Using the GPIO driver, and the NAND chip is a HYNIX.
> 
> I can say that the ECC weakness doesn't exist until v5.5 included. The 
> weakness appears with v5.6.
> 
> I have tried bisection between those two versions and I couldn't end up to a 
> reliable result. The closer the v5.5 you go, the more difficult it is to 
> reproduce the issue.
> 
> So I looked at what was done around the places, and in fact that's mainly 
> optimisation in the powerpc code. It seems that the more powerpc is 
> optimised, the more the problem occurs.
> 
> Looking at the GPIO nand driver, I saw that no-op gpio_nand_dosync() 
> function. By adding a memory barrier in that function, the ECC weakness 
> disappeared completely.

I see that the 'fix' in gpio_nand_dosync() has only been designed for
ARM platforms, perhaps it would make sense to have a PPC variant here?

> Not sure what the final solution has to be.

Perhaps PowerPC maintainers can sched some light on these findings?

Thanks,
Miquèl

Reply via email to