On 05/15/2013 03:12:21 AM, Anthony Foiani wrote:
At this point, /dev/sda is pretty much unusable, and I have to do at
least a reboot to recover. (I don't recall if I had to do a power
cycle at this point, though.)
I suspect that it is related to errata eLBC-A001 (from MPC8315E Chip
Errata, Rev. 3, 09/2011):
eLBC-A001:
Simultaneous FCM and GPCM or UPM operation may erroneously trigger
bus monitor timeout
Description: Devices: MPC8315E, MPC8314E
When the FCM is in the middle of a long transaction, such as NAND
erase or write, another transaction on the GPCM or UPM triggers the
bus monitor to start immediately for the GPCM or UPM, even though
the GPCM or UPM is still waiting for the FCM to finish and has not
yet started its transaction. If the bus monitor timeout value is not
programmed for a sufficiently large value, the local bus monitor may
time out. This timeout corrupts the current NAND Flash operation and
terminate the GPCM or UPM operation.
Impact: Local bus monitor may time out unexpectedly and corrupt the
NAND transaction.
Workaround: Set the local bus monitor timeout value to the maximum
by setting LBCR[BMT] = 0 and LBCR[BMTPS] = 0xF.
Fix plan: No plans to fix
But it seems that erratum is already fixed:
http://patchwork.ozlabs.org/patch/96339/
(git patch d08e44570e)
Am I reading that correctly?
Yes, that erratum has been worked around.
(I'm already writing only one flash
sector at a time, but it might be that even a single 0x10000-byte
sector takes long enough to trigger the issue.)
I don't think this erratum is relevant. Unlike NAND, NOR flash does
not involve holding the localbus for extended periods of time. I also
don't see how it would interact with SATA, which is separate from the
localbus. Are you seeing any errors on the localbus, or just on SATA?
I also verified that
I have the relevant property in my device tree:
localbus@e0005000 {
...
compatible = "fsl,mpc8315-elbc", "fsl,elbc", "simple-bus";
So, my questions are:
1. Is anyone else seeing something like this?
2. Is there an obvious way for our code to detect that we're in the
middle of error recovery, so we can not write to the disk until
recovery is complete?
3. Is there any chance that the 1.5Gbps limiting code might have
exacerbated the problems?
4. Should I open a support request with Freescale, or if someone from
Freescale is already reading this, could you look to see if anyone
else has reported it?
Hopefully Shaohui (our SATA person) can answer these. If you don't get
an answer, go ahead and open an official support request.
-Scott
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev