Hello,

On Wed, Feb 19, 2025 at 10:29:45PM +0000, Jonathan McDowell wrote:
> On Wed, Jan 29, 2025 at 04:27:15PM +0100, Michal Suchánek wrote:
> > Hello,
> > 
> > there is a problem report that booting a specific type of system about
> > 0.1% of the time encrypted volume (using a PCR to release the key) fails
> > to unlock because of TPM operation timeout.
> > 
> > Minimizing the test case failed so far.
> > 
> > For example, booting into text mode as opposed to graphical desktop
> > makes the problem unreproducible.
> > 
> > The test is done with a frankenkernel that has TPM drivers about on par
> > with Linux 6.4 but using actual Linux 6.4 the problem is not
> > reproducible, either.
> > 
> > However, given the problem takes up to a day to reproduce I do not have
> > much confidence in the negative results.
> 
> Michal, can you possibly try the below and see if it helps out? There
> seems to be a timing bug introduced in 6.4+ that I think might be
> related, and matches up with some of our internal metrics that showed an
> increase in timeouts in 6.4 onwards.

Thanks for looking into this

> 
> commit 79041fba797d0fe907e227012767f56dd93fac32
> Author: Jonathan McDowell <nood...@meta.com>
> Date:   Wed Feb 19 16:20:44 2025 -0600
> 
>     tpm, tpm_tis: Fix timeout handling when waiting for TPM status
>     
>     The change to only use interrupts to handle supported status changes,
>     then switch to polling for the rest, inverted the status test and sleep
>     such that we can end up sleeping beyond our timeout and not actually
>     checking the status. This can result in spurious TPM timeouts,
>     especially on a more loaded system. Fix by switching the order back so
>     we sleep *then* check. We've done a up front check when we enter the
>     function so this won't cause an additional delay when the status is
>     already what we're looking for.
>     
>     Cc: sta...@vger.kernel.org # v6.4+
>     Fixes: e87fcf0dc2b4 ("tpm, tpm_tis: Only handle supported interrupts")
>     Signed-off-by: Jonathan McDowell <nood...@meta.com>
> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index fdef214b9f6b..167d71747666 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -114,11 +114,11 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 
> mask,
>               return 0;
>       /* process status changes without irq support */
>       do {
> +             usleep_range(priv->timeout_min,
> +                          priv->timeout_max);

What would be the priv->timeout_min and priv->timeout_max here?

Note that there are timeouts that are 200ms, and are overblown by 2s.

If the 200ms timeout relies on the sleep during the wait for the timeout
being much longer than the timeout itself then the timeout is arguably
bogus regardless of this change helping.

Thanks

Michal

>               status = chip->ops->status(chip);
>               if ((status & mask) == mask)
>                       return 0;
> -             usleep_range(priv->timeout_min,
> -                          priv->timeout_max);
>       } while (time_before(jiffies, stop));
>       return -ETIME;
>  }

Reply via email to