On Thu, 2020-09-03 at 23:58 +0000, Chris Packham wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you recognize the sender and know the 
> content is safe.
> 
> 
> On 1/09/20 6:14 pm, Nicholas Piggin wrote:
> > Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> > > On 1/09/20 12:33 am, Heiner Kallweit wrote:
> > > > On 30.08.2020 23:59, Chris Packham wrote:
> > > > > On 31/08/20 9:41 am, Heiner Kallweit wrote:
> > > > > > On 30.08.2020 23:00, Chris Packham wrote:
> > > > > > > On 31/08/20 12:30 am, Nicholas Piggin wrote:
> > > > > > > > Excerpts from Chris Packham's message of August 28, 2020 8:07 
> > > > > > > > am:
> > > > > > > <snip>
> > > > > > > 
> > > > > > > > > > > > > I've also now seen the RX FIFO not empty error on the 
> > > > > > > > > > > > > T2080RDB
> > > > > > > > > > > > > 
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON 
> > > > > > > > > > > > > isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON 
> > > > > > > > > > > > > isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON 
> > > > > > > > > > > > > isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON 
> > > > > > > > > > > > > isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx 
> > > > > > > > > > > > > fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 
> > > > > > > > > > > > > 32
> > > > > > > > > > > > > 
> > > > > > > > > > > > > With my current workaround of emptying the RX FIFO. 
> > > > > > > > > > > > > It seems
> > > > > > > > > > > > > survivable. Interestingly it only ever seems to be 1 
> > > > > > > > > > > > > extra byte in the
> > > > > > > > > > > > > RX FIFO and it seems to be after either a READ_SR or 
> > > > > > > > > > > > > a READ_FSR.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 70
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON 
> > > > > > > > > > > > > isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx 
> > > > > > > > > > > > > fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 
> > > > > > > > > > > > > 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON 
> > > > > > > > > > > > > isn't set!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx 
> > > > > > > > > > > > > fifo's aren't empty!
> > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 
> > > > > > > > > > > > > 32
> > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05
> > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00
> > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > > > > > > > > > 
> > > > > > > > > > > > >      From all the Micron SPI-NOR datasheets I've got 
> > > > > > > > > > > > > access to it is
> > > > > > > > > > > > > possible to continually read the SR/FSR. But I've no 
> > > > > > > > > > > > > idea why it
> > > > > > > > > > > > > happens some times and not others.
> > > > > > > > > > > > So I think I've got a reproduction and I think I've 
> > > > > > > > > > > > bisected the problem
> > > > > > > > > > > > to commit 3282a3da25bd ("powerpc/64: Implement soft 
> > > > > > > > > > > > interrupt replay in
> > > > > > > > > > > > C"). My day is just finishing now so I haven't applied 
> > > > > > > > > > > > too much scrutiny
> > > > > > > > > > > > to this result. Given the various rabbit holes I've 
> > > > > > > > > > > > been down on this
> > > > > > > > > > > > issue already I'd take this information with a good 
> > > > > > > > > > > > degree of skepticism.
> > > > > > > > > > > > 
> > > > > > > > > > > OK, so an easy test should be to re-test with a 5.4 
> > > > > > > > > > > kernel.
> > > > > > > > > > > It doesn't have yet the change you're referring to, and 
> > > > > > > > > > > the fsl-espi driver
> > > > > > > > > > > is basically the same as in 5.7 (just two small changes 
> > > > > > > > > > > in 5.7).
> > > > > > > > > > There's 6cc0c16d82f88 and maybe also other interrupt 
> > > > > > > > > > related patches
> > > > > > > > > > around this time that could affect book E, so it's good if 
> > > > > > > > > > that exact
> > > > > > > > > > patch is confirmed.
> > > > > > > > > My confirmation is basically that I can induce the issue in a 
> > > > > > > > > 5.4 kernel
> > > > > > > > > by cherry-picking 3282a3da25bd. I'm also able to "fix" the 
> > > > > > > > > issue in
> > > > > > > > > 5.9-rc2 by reverting that one commit.
> > > > > > > > > 
> > > > > > > > > I both cases it's not exactly a clean cherry-pick/revert so I 
> > > > > > > > > also
> > > > > > > > > confirmed the bisection result by building at 3282a3da25bd 
> > > > > > > > > (which sees
> > > > > > > > > the issue) and the commit just before (which does not).
> > > > > > > > Thanks for testing, that confirms it well.
> > > > > > > > 
> > > > > > > > [snip patch]
> > > > > > > > 
> > > > > > > > > I still saw the issue with this change applied. 
> > > > > > > > > PPC_IRQ_SOFT_MASK_DEBUG
> > > > > > > > > didn't report anything (either with or without the change 
> > > > > > > > > above).
> > > > > > > > Okay, it was a bit of a shot in the dark. I still can't see what
> > > > > > > > else has changed.
> > > > > > > > 
> > > > > > > > What would cause this, a lost interrupt? A spurious interrupt? 
> > > > > > > > Or
> > > > > > > > higher interrupt latency?
> > > > > > > > 
> > > > > > > > I don't think the patch should cause significantly worse 
> > > > > > > > latency,
> > > > > > > > (it's supposed to be a bit better if anything because it 
> > > > > > > > doesn't set
> > > > > > > > up the full interrupt frame). But it's possible.
> > > > > > > My working theory is that the SPI_DON indication is all about the 
> > > > > > > TX
> > > > > > > direction an now that the interrupts are faster we're hitting an 
> > > > > > > error
> > > > > > > because there is still RX activity going on. Heiner disagrees 
> > > > > > > with my
> > > > > > > interpretation of the SPI_DON indication and the fact that it 
> > > > > > > doesn't
> > > > > > > happen every time does throw doubt on it.
> > > > > > > 
> > > > > > It's right that the eSPI spec can be interpreted that SPI_DON 
> > > > > > refers to
> > > > > > TX only. However this wouldn't really make sense, because also for 
> > > > > > RX
> > > > > > we program the frame length, and therefore want to be notified once 
> > > > > > the
> > > > > > full frame was received. Also practical experience shows that 
> > > > > > SPI_DON
> > > > > > is set also after RX-only transfers.
> > > > > > Typical SPI NOR use case is that you write read command + start 
> > > > > > address,
> > > > > > followed by a longer read. If the TX-only interpretation would be 
> > > > > > right,
> > > > > > we'd always end up with SPI_DON not being set.
> > > > > > 
> > > > > > > I can't really explain the extra RX byte in the fifo. We know how 
> > > > > > > many
> > > > > > > bytes to expect and we pull that many from the fifo so it's not 
> > > > > > > as if
> > > > > > > we're missing an interrupt causing us to skip the last byte. I've 
> > > > > > > been
> > > > > > > looking for some kind of off-by-one calculation but again if it 
> > > > > > > were
> > > > > > > something like that it'd happen all the time.
> > > > > > > 
> > > > > > Maybe it helps to know what value this extra byte in the FIFO has. 
> > > > > > Is it:
> > > > > > - a duplicate of the last read byte
> > > > > > - or the next byte (at <end address> + 1)
> > > > > > - or a fixed value, e.g. always 0x00 or 0xff
> > > > > The values were up thread a bit but I'll repeat them here
> > > > > 
> > > > > fsl_espi ffe110000.spi: tx 70
> > > > > fsl_espi ffe110000.spi: rx 03
> > > > > fsl_espi ffe110000.spi: Extra RX 00
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
> > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
> > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> > > > > fsl_espi ffe110000.spi: tx 05
> > > > > fsl_espi ffe110000.spi: rx 00
> > > > > fsl_espi ffe110000.spi: Extra RX 03
> > > > > 
> > > > > 
> > > > > The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them 
> > > > > with
> > > > > either a READ_SR or a READ_FSR. Never a data read.
> > > > > 
> > > > Just remembered something about SPIE_DON:
> > > > Transfers are always full duplex, therefore in case of a read the chip
> > > > sends dummy zero's. Having said that in case of a read SPIE_DON means
> > > > that the last dummy zero was shifted out.
> > > > 
> > > > READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte 
> > > > in.
> > > > So the issue may have a dependency on the length of the transfer.
> > > > However I see no good explanation so far. You can try adding a delay of
> > > > a few miroseconds between the following to commands in fsl_espi_bufs().
> > > > 
> > > >     fsl_espi_write_reg(espi, ESPI_SPIM, mask);
> > > > 
> > > >     /* Prevent filling the fifo from getting interrupted */
> > > >     spin_lock_irq(&espi->lock);
> > > > 
> > > > Maybe enabling interrupts and seeing the SPIE_DON interrupt are too 
> > > > close.
> > > I think this might be heading in the right direction. Playing about with
> > > a delay does seem to make the two symptoms less likely. Although I have
> > > to set it quite high (i.e. msleep(100)) to completely avoid any
> > > possibility of seeing either message.
> > The patch might replay the interrupt a little bit faster, but it would
> > be a few microseconds at most I think (just from improved code).
> > 
> > Would you be able to ftrace the interrupt handler function and see if you
> > can see a difference in number or timing of interrupts? I'm at a bit of
> > a loss.
> 
> I tried ftrace but I really wasn't sure what I was looking for.
> Capturing a "bad" case was pretty tricky. But I think I've identified a
> fix (I'll send it as a proper patch shortly). The gist is
> 
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> index 7e7c92cafdbb..cb120b68c0e2 100644
> --- a/drivers/spi/spi-fsl-espi.c
> +++ b/drivers/spi/spi-fsl-espi.c
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
> *espi, u32 events)
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>   {
>          struct fsl_espi *espi = context_data;
> -       u32 events;
> +       u32 events, mask;
> 
>          spin_lock(&espi->lock);
> 
>          /* Get interrupt events(tx/rx) */
>          events = fsl_espi_read_reg(espi, ESPI_SPIE);
> -       if (!events) {
> +       mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> +       if (!(events & mask)) {
>                  spin_unlock(&espi->lock);
>                  return IRQ_NONE;
>          }
> 
> The SPIE register contains the TXCNT so events is pretty much always
> going to have something set. By checking events against what we've
> actually requested interrupts for we don't see any spurious events.
> 
> I've tested this on the T2080RDB and on our custom hardware and it seems
> to resolve the problem.
> 

I looked at the fsl_espi_irq() too and noticed that clearing of the IRQ events
are after processing TX/RX. That looks a bit odd to me.

  Jocke

Reply via email to