On 04-Nov-2021, at 11:25 AM, Michael Ellerman <m...@ellerman.id.au> wrote:

Nathan Lynch <nath...@linux.ibm.com> writes:
Nicholas Piggin <npig...@gmail.com> writes:
Excerpts from Michael Ellerman's message of October 29, 2021 11:15 pm:
Nicholas Piggin <npig...@gmail.com> writes:
Excerpts from Athira Rajeev's message of October 29, 2021 1:05 pm:
@@ -631,12 +632,18 @@ static int pseries_migrate_partition(u64 handle)
if (ret)
return ret;

+ /* Disable PMU before suspend */
+ on_each_cpu(&mobility_pmu_disable, NULL, 0);

Why was this moved out of stop machine and to an IPI?

My concern would be, what are the other CPUs doing at this time? Is it 
possible they could take interrupts and schedule? Could that mess up the
perf state here?

pseries_migrate_partition() is called directly from migration_store(),
which is the sysfs store function, which can be called concurrently by
different CPUs.

It's also potentially called from rtas_syscall_dispatch_ibm_suspend_me(),
from sys_rtas(), again with no locking.

So we could have two CPUs calling into here at the same time, which
might not crash, but is unlikely to work well.

I think the lack of locking might have been OK in the past because only
one CPU will successfully get the other CPUs to call do_join() in
pseries_suspend(). But I could be wrong.

Anyway, now that we're mutating the PMU state before suspending we need
to be more careful. So I think we need a lock around the whole
sequence.

Regardless of the outcome here, generally agreed that some serialization
should be imposed in this path. The way the platform works (and some
extra measures by the drmgr utility) make it so that this code isn't
entered concurrently in usual operation, but it's possible to make it
happen if you are root.

Yeah I agree it's unlikely to be a problem in practice.

A file-static mutex should be OK.

Ack.

My concern is still that we wouldn't necessarily have the other CPUs 
under control at that point even if we serialize the migrate path.
They could take interrupts, possibly call into perf subsystem after
the mobility_pmu_disable (e.g., via syscall or context switch) which 
might mess things up.

I think the stop machine is a reasonable place for the code in this 
case. It's a low level disabling of hardware facility and saving off 
registers.

That makes sense, but I can't help feeling concerned still. For this to
be safe, power_pmu_enable() and power_pmu_disable() must never sleep or
re-enable interrupts or send IPIs. I don't see anything obviously unsafe
right now, but is that already part of their contract? Is there much
risk they could change in the future to violate those constraints?

That aside, the proposed change seems like we would be hacking around a
more generic perf/pmu limitation in a powerpc-specific way. I see the
same behavior on x86 across suspend/resume.

# perf stat -a -e cache-misses -I 1000 & sleep 2 ; systemctl suspend ; sleep 20 ; kill $(jobs -p)
[1] 189806
#           time             counts unit events
    1.000501710          9,983,649      cache-misses
    2.002620321         14,131,072      cache-misses
    3.004579071         23,010,971      cache-misses
    9.971854783 140,737,491,680,853      cache-misses
   10.982669250                  0      cache-misses
   11.984660498                  0      cache-misses
   12.986648392                  0      cache-misses
   13.988561766                  0      cache-misses
   14.992670615                  0      cache-misses
   15.994938111                  0      cache-misses
   16.996703952                  0      cache-misses
   17.999092812                  0      cache-misses
   19.000602677                  0      cache-misses
   20.003272216                  0      cache-misses
   21.004770295                  0      cache-misses
# uname -r
5.13.19-100.fc33.x86_64

That is interesting.

Athira, I guess we should bring that to the perf maintainers and see if
there's any interest in solving the issue in a generic fashion.

Sure, will work on proposal to start discussion with the community.

Thanks
Athira 

cheers

Reply via email to