[PATCH] x86/hpet: Read HPET directly if panic in progress

2024-05-27 Thread Tony W Wang-oc
When the clocksource of the system is HPET,a CPU executing read_hpet
might be interrupted by exceptions to executing the panic,this may
lead to read_hpet dead loops:

CPU x   CPU x

read_hpet()
  arch_spin_trylock(&hpet.lock)
  [CPU x got the hpet.lock] #MCE happened
do_machine_check()
  mce_panic()
panic()
  kmsg_dump()
pstore_dump()
  pstore_record_init()
ktime_get_real_fast_ns()
  read_hpet()
  [dead loops]

To avoid this dead loops, read HPET directly if panic in progress.

Signed-off-by: Tony W Wang-oc 
---
 arch/x86/kernel/hpet.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index c96ae8fee95e..ecadd0698d6a 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
if (in_nmi())
return (u64)hpet_readl(HPET_COUNTER);
 
+   /*
+* Read HPET directly if panic in progress.
+*/
+   if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
+   return (u64)hpet_readl(HPET_COUNTER);
+
/*
 * Read the current state of the lock and HPET value atomically.
 */
-- 
2.25.1




Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-05-28 Thread Tony W Wang-oc




On 2024/5/29 06:12, Thomas Gleixner wrote:



[这封邮件来自外部发件人 谨防风险]

On Tue, May 28 2024 at 07:18, Dave Hansen wrote:

On 5/27/24 23:38, Tony W Wang-oc wrote:
...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c

index c96ae8fee95e..ecadd0698d6a 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
  if (in_nmi())
  return (u64)hpet_readl(HPET_COUNTER);

+/*
+ * Read HPET directly if panic in progress.
+ */
+if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
+return (u64)hpet_readl(HPET_COUNTER);
+


There is literally one other piece of the code in the kernel doing
something similar: the printk() implementation.  There's no other
clocksource or timekeeping code that does this on any architecture.

Why doesn't this problem apply to any other clock sources?


I principle it applies to any clocksource which needs a spinlock to
serialize access. HPET is not the only insanity here.

Think about i8253 :)

Most real clocksources, like TSC and the majority of the preferred clock
sources on other architectures are perfectly fine. They just read and be
done.


Why should the problem be fixed in the clock sources themselves?  Why
doesn't printk() deadlock on systems using the HPET?


Because regular printk()s are deferred to irq work when in NMI and
similar contexts, but that obviously does not apply to panic
situations. Also NMI is treated special even in the HPET code. See
below.


In other words, I think we should fix pstore to be more like printk
rather than hacking around this in each clock source.


pstore is perfectly fine. It uses a NMI safe time accessor function
which is then tripping over the HPET lock. That's really a HPET specific
problem.

Though what I read out of the changelog is that the MCE hits the same
CPU 'x' which holds the lock. But that's fairy tale material as you can
see in the patch above:

 if (in_nmi())
 return (u64)hpet_readl(HPET_COUNTER);

For that particular case the dead lock, which would actually be a live
lock, cannot happen because in kernel MCEs are NMI class exceptions and
therefore in_nmi() evaluates to true and that new voodoo can't be
reached at all.

Now there are two other scenarios which really can make that happen:

  1) A non-NMI class exception within the lock held region

 CPU A
 acquire(hpet_lock);
 ... <- #PF, #GP, #DE, ... -> panic()

 If any of that happens within that lock held section then the live
 lock on the hpet_lock is the least of your worries. Seriously, I
 don't care about this at all.

  2) The actual scenario is:

 CPU A   CPU B
 lock(hpet_lock)
 MCE hits user space
 ...
 exc_machine_check_user()
   irqentry_enter_from_user_mode(regs);

 irqentry_enter_from_user_mode() obviously does not mark the
 exception as NMI class, so in_nmi() evaluates to false. That would
 actually dead lock if CPU A is not making progress and releases
 hpet_lock.

 Sounds unlikely to happen, right? But in reality it can because of
 MCE broadcast. Assume that both CPUs go into MCE:

 CPU A   CPU B
 lock(hpet_lock)
 exc_machine_check_user()
   irqentry_enter_from_user_mode();
 exc_machine_check_kernel()do_machine_check()
   irqentry_nmi_enter(); mce_panic()
   do_machine_check()if (atomic_inc_return(&mce_panicked) > 1)
 mce_panic() wait_for_panic(); <- Not taken

 if (atomic_inc_return(&mce_panicked) > 1)
 wait_for_panic(); <- Taken

 
 hpet_read()

 -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
obviously can't release the lock.



Because MCE handler will call printk() before call the panic, so 
printk() deadlock may happen in this scenario:


CPU ACPU B
printk()
  lock(console_owner_lock)
 exc_machine_check_user()
   irqentry_enter_from_user_mode()
exc_machine_check_kernel() do_machine_check()
  irqentry_nmi_enter()   mce_panic()
  do_machine_check() printk_mce()  #A
mce_panic()  ...
  wait_for_panic()   panic()

printk deadlock will happened at #A because in_nmi() evaluates to false 
on CPU B and CPU B do not enter the panic() AT #A.


Update user space MCE handler to NMI class context is preferred?

Sincerely
TonyWWang-oc


So the proposed patch makes sense t

Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-05-28 Thread Tony W Wang-oc




On 2024/5/29 12:39, Tony W Wang-oc wrote:



On 2024/5/29 06:12, Thomas Gleixner wrote:



[这封邮件来自外部发件人 谨防风险]

On Tue, May 28 2024 at 07:18, Dave Hansen wrote:

On 5/27/24 23:38, Tony W Wang-oc wrote:
...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c

index c96ae8fee95e..ecadd0698d6a 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
  if (in_nmi())
  return (u64)hpet_readl(HPET_COUNTER);

+    /*
+ * Read HPET directly if panic in progress.
+ */
+    if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
+    return (u64)hpet_readl(HPET_COUNTER);
+


There is literally one other piece of the code in the kernel doing
something similar: the printk() implementation.  There's no other
clocksource or timekeeping code that does this on any architecture.

Why doesn't this problem apply to any other clock sources?


I principle it applies to any clocksource which needs a spinlock to
serialize access. HPET is not the only insanity here.

Think about i8253 :)

Most real clocksources, like TSC and the majority of the preferred clock
sources on other architectures are perfectly fine. They just read and be
done.


Why should the problem be fixed in the clock sources themselves?  Why
doesn't printk() deadlock on systems using the HPET?


Because regular printk()s are deferred to irq work when in NMI and
similar contexts, but that obviously does not apply to panic
situations. Also NMI is treated special even in the HPET code. See
below.


In other words, I think we should fix pstore to be more like printk
rather than hacking around this in each clock source.


pstore is perfectly fine. It uses a NMI safe time accessor function
which is then tripping over the HPET lock. That's really a HPET specific
problem.

Though what I read out of the changelog is that the MCE hits the same
CPU 'x' which holds the lock. But that's fairy tale material as you can
see in the patch above:

 if (in_nmi())
 return (u64)hpet_readl(HPET_COUNTER);

For that particular case the dead lock, which would actually be a live
lock, cannot happen because in kernel MCEs are NMI class exceptions and
therefore in_nmi() evaluates to true and that new voodoo can't be
reached at all.

Now there are two other scenarios which really can make that happen:

  1) A non-NMI class exception within the lock held region

 CPU A
 acquire(hpet_lock);
 ... <- #PF, #GP, #DE, ... -> panic()

 If any of that happens within that lock held section then the live
 lock on the hpet_lock is the least of your worries. Seriously, I
 don't care about this at all.



Actually, this scenario is what this patch is trying to solve.

We encountered hpet_lock deadlock from the call path of the MCE handler,
and this hpet_lock deadlock scenario may happen when others exceptions'
handler like #PF/#GP... to call the panic. So read_hpet should avoid
deadlock if panic in progress.

Sincerely
TonyWWang-oc


  2) The actual scenario is:

 CPU A   CPU B
 lock(hpet_lock)
 MCE hits user space
 ...
 exc_machine_check_user()
   irqentry_enter_from_user_mode(regs);

 irqentry_enter_from_user_mode() obviously does not mark the
 exception as NMI class, so in_nmi() evaluates to false. That would
 actually dead lock if CPU A is not making progress and releases
 hpet_lock.

 Sounds unlikely to happen, right? But in reality it can because of
 MCE broadcast. Assume that both CPUs go into MCE:

 CPU A   CPU B
 lock(hpet_lock)
 exc_machine_check_user()
   irqentry_enter_from_user_mode();
 exc_machine_check_kernel()    do_machine_check()
   irqentry_nmi_enter(); mce_panic()
   do_machine_check()    if 
(atomic_inc_return(&mce_panicked) > 1)

 mce_panic() wait_for_panic(); <- Not taken

 if (atomic_inc_return(&mce_panicked) > 1)
 wait_for_panic(); <- Taken

 
 hpet_read()

 -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
    obviously can't release the lock.



Because MCE handler will call printk() before call the panic, so 
printk() deadlock may happen in this scenario:


CPU A    CPU B
printk()
   lock(console_owner_lock)
  exc_machine_check_user()
    irqentry_enter_from_user_mode()
exc_machine_check_kernel() do_machine_check()
   irqentry_nmi_enter() 

Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-05-29 Thread Tony W Wang-oc




On 2024/5/29 15:45, Thomas Gleixner wrote:



[这封邮件来自外部发件人 谨防风险]

On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote:

Actually, this scenario is what this patch is trying to solve.

We encountered hpet_lock deadlock from the call path of the MCE handler,
and this hpet_lock deadlock scenario may happen when others exceptions'
handler like #PF/#GP... to call the panic. So read_hpet should avoid
deadlock if panic in progress.


That's not what your changelog says.


Yes.
The example flow I gave with the MCE handler in my changelog is misleading.



Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-06-04 Thread Tony W Wang-oc




On 2024/5/29 15:42, Thomas Gleixner wrote:



[这封邮件来自外部发件人 谨防风险]

Linus!

On Tue, May 28 2024 at 16:22, Linus Torvalds wrote:

On Tue, 28 May 2024 at 15:12, Thomas Gleixner  wrote:
I see the smiley, but yeah, I don't think we really care about it.


Indeed. But the same problem exists on other architectures as
well. drivers/clocksource alone has 4 examples aside of i8253


   1) Should we provide a panic mode read callback for clocksources which
  are affected by this?


The current patch under discussion may be ugly, but looks workable.
Local ugliness isn't necessarily a show-stopper.

So if the HPET is the *only* case which has this situation, I vote for
just doing the ugly thing.

Now, if *other* cases exist, and can't be worked around in similar
ways, then that argues for a more "proper" fix.

And no, I don't think i8253 is a strong enough argument. I don't
actually believe you can realistically find a machine that doesn't
have HPET or the TSC and really falls back on the i8253 any more. And
if you *do* find hw like that, is it SMP-capable? And can you find
somebody who cares?


Probably not.


   2) Is it correct to claim that a MCE which hits user space and ends up in
  mce_panic() is still just a regular exception or should we upgrade to
  NMI class context when we enter mce_panic() or even go as far to
  upgrade to NMI class context for any panic() invocation?




After MCE has occurred, it is possible for the MCE handler to execute 
the add_taint() function without panic. For example, the fake_panic is 
configured.


So the above patch method does not seem to be able to cover the printk 
deadlock caused by the add_taint() function in the MCE handler when a 
MCE occurs in user space.


Sincerely
TonyWWang-oc


I do think that an NMI in user space should be considered mostly just
a normal exception. From a kernel perspective, the NMI'ness just
doesn't matter.


That's correct. I don't want to change that at all especially not for
recoverable MCEs.


That said, I find your suggestion of making 'panic()' just basically
act as an NMI context intriguing. And cleaner than the
atomic_read(&panic_cpu) thing.

Are there any other situations than this odd HPET thing where that
would change semantics?


I need to go and stare at this some more.

Thanks,

 tglx




Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-06-05 Thread Tony W Wang-oc




On 2024/6/5 17:36, Borislav Petkov wrote:



[这封邮件来自外部发件人 谨防风险]

On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote:

After MCE has occurred, it is possible for the MCE handler to execute the
add_taint() function without panic. For example, the fake_panic is
configured.


fake_panic is an ancient debugging leftover. If you set it, you get what
you deserve.



It may also happen without setting fake_panic, such as an MCE error of 
the UCNA/SRAO type occurred?


Sincerely
TonyWWang-oc


--
Regards/Gruss,
 Boris.

https://people.kernel.org/tglx/notes-about-netiquette




Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-06-05 Thread Tony W Wang-oc




On 2024/6/5 19:33, Borislav Petkov wrote:



[这封邮件来自外部发件人 谨防风险]

On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote:

It may also happen without setting fake_panic, such as an MCE error of the
UCNA/SRAO type occurred?


Which types exactly do you mean when you're looking at the severities[]
array in severity.c?

And what scenario are you talking about?

To get an #MC exception and detect only UCNA/SRAO errors? Can that even
happen on any hardware?



Yes, I mean an #MC exception happened and detect only like SRAO errors 
like below:


MCESEV(
AO, "Action optional: memory scrubbing error",
SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK, 
MCI_STATUS_UC|MCACOD_SCRUB)

),
MCESEV(
AO, "Action optional: last level cache writeback error",
SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
),

I think these errors are actually encountered on some platforms that 
support these type of errors report to the #MC.


Sincerely
TonyWWang-oc

--
Regards/Gruss,
 Boris.

https://people.kernel.org/tglx/notes-about-netiquette




Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-06-06 Thread Tony W Wang-oc




On 2024/6/5 23:51, Luck, Tony wrote:



[这封邮件来自外部发件人 谨防风险]


Which types exactly do you mean when you're looking at the severities[]
array in severity.c?

And what scenario are you talking about?

To get an #MC exception and detect only UCNA/SRAO errors? Can that even
happen on any hardware?



Yes, I mean an #MC exception happened and detect only like SRAO errors
like below:

  MCESEV(
  AO, "Action optional: memory scrubbing error",
  SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
MCI_STATUS_UC|MCACOD_SCRUB)
  ),
  MCESEV(
  AO, "Action optional: last level cache writeback error",
  SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
  ),

I think these errors are actually encountered on some platforms that
support these type of errors report to the #MC.


Intel servers from Nehalem through Cascade Lake reported memory controller
patrol scrub uncorrected error with #MC and SRAO signature.

Icelake and newer use CMCI with a UCNA signature.



I have a question, does Intel use #MC to report UCNA errors?

Sincerely
TonyWWang-oc



Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

2024-07-01 Thread Tony W Wang-oc




On 2024/5/29 06:12, Thomas Gleixner wrote:



[这封邮件来自外部发件人 谨防风险]

On Tue, May 28 2024 at 07:18, Dave Hansen wrote:

On 5/27/24 23:38, Tony W Wang-oc wrote:
...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c

index c96ae8fee95e..ecadd0698d6a 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
  if (in_nmi())
  return (u64)hpet_readl(HPET_COUNTER);

+/*
+ * Read HPET directly if panic in progress.
+ */
+if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
+return (u64)hpet_readl(HPET_COUNTER);
+


There is literally one other piece of the code in the kernel doing
something similar: the printk() implementation.  There's no other
clocksource or timekeeping code that does this on any architecture.

Why doesn't this problem apply to any other clock sources?


I principle it applies to any clocksource which needs a spinlock to
serialize access. HPET is not the only insanity here.

Think about i8253 :)

Most real clocksources, like TSC and the majority of the preferred clock
sources on other architectures are perfectly fine. They just read and be
done.


Why should the problem be fixed in the clock sources themselves?  Why
doesn't printk() deadlock on systems using the HPET?


Because regular printk()s are deferred to irq work when in NMI and
similar contexts, but that obviously does not apply to panic
situations. Also NMI is treated special even in the HPET code. See
below.


In other words, I think we should fix pstore to be more like printk
rather than hacking around this in each clock source.


pstore is perfectly fine. It uses a NMI safe time accessor function
which is then tripping over the HPET lock. That's really a HPET specific
problem.

Though what I read out of the changelog is that the MCE hits the same
CPU 'x' which holds the lock. But that's fairy tale material as you can
see in the patch above:

 if (in_nmi())
 return (u64)hpet_readl(HPET_COUNTER);

For that particular case the dead lock, which would actually be a live
lock, cannot happen because in kernel MCEs are NMI class exceptions and
therefore in_nmi() evaluates to true and that new voodoo can't be
reached at all.

Now there are two other scenarios which really can make that happen:

  1) A non-NMI class exception within the lock held region

 CPU A
 acquire(hpet_lock);
 ... <- #PF, #GP, #DE, ... -> panic()

 If any of that happens within that lock held section then the live
 lock on the hpet_lock is the least of your worries. Seriously, I
 don't care about this at all.

  2) The actual scenario is:

 CPU A   CPU B
 lock(hpet_lock)
 MCE hits user space
 ...
 exc_machine_check_user()
   irqentry_enter_from_user_mode(regs);

 irqentry_enter_from_user_mode() obviously does not mark the
 exception as NMI class, so in_nmi() evaluates to false. That would
 actually dead lock if CPU A is not making progress and releases
 hpet_lock.

 Sounds unlikely to happen, right? But in reality it can because of
 MCE broadcast. Assume that both CPUs go into MCE:

 CPU A   CPU B
 lock(hpet_lock)
 exc_machine_check_user()
   irqentry_enter_from_user_mode();
 exc_machine_check_kernel()do_machine_check()
   irqentry_nmi_enter(); mce_panic()
   do_machine_check()if (atomic_inc_return(&mce_panicked) > 1)
 mce_panic() wait_for_panic(); <- Not taken

 if (atomic_inc_return(&mce_panicked) > 1)
 wait_for_panic(); <- Taken

 
 hpet_read()

 -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
obviously can't release the lock.



For this scenario, an experiment was designed for the printk:
a, Install a driver module that repeatedly sending IPIs to multiple 
cores to executes printk.

b, Run a user-level testing tool like stream on all cores.
c, Trigger a MCE hardware error.
During burnin tests a-c, reproduce the following case:

CPU A  CPU B
printk()
console_owner
   exc_machine_check_user()
 irqentry_enter_from_user_mode()
exc_machine_check_kernel()   do_machine_check()
  irqentry_nmi_enter() mce_panic()
  do_machine_check() print_mce()
   ...
...while(console_waiter)