On Wed, May 22, 2019 at 02:23:47PM -0400, Rik van Riel wrote: > On Wed, 2019-05-22 at 16:49 +0200, Peter Zijlstra wrote: > > On Wed, May 22, 2019 at 03:37:11PM +0100, Andrew Murray wrote: > > > > Is perhaps the problem that on_each_cpu_cond() uses > > > > cpu_onlne_mask > > > > without protection? > > > > > > Does this prevent racing with a CPU going offline? I guess this > > > prevents > > > the warning at the expense of a lock - but is only beneficial in > > > the > > > unlikely path. (In the likely path this prevents new CPUs going > > > offline > > > but we don't care because we don't WARN if they aren't they when we > > > attempt to call functions). > > > > > > At least this is my limited understanding. > > > > Hmm.. I don't think it could matter, we only use the mask when > > preempt_disable(), which would already block offline, due to it using > > stop_machine(). > > > > So the patch is a no-op. > > > > What's the WARN you see? TLB invalidation should pass mm_cpumask(), > > which similarly should not contain offline CPUs I'm thinking. > > Does the TLB invalidation code have anything in it > to prevent from racing with the CPU offline code? > > In other words, could we end up with the TLB > invalidation code building its bitmask, getting > interrupted (eg. hypervisor preemption, NMI), > and not sending out the IPI to that bitmask of > CPUs until after one of the CPUs in the bitmap > has gotten offlined?
One possible thing would be if cpu-offline didn't remove the bit from mm_cpumask() because it entered lazy state earlier or something. Then, mm_cpumask() would contain an offline CPU and the WARN could trigger, but I don't _think_ we do that, but I didn't check.