Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

Ingo Molnar Fri, 01 May 2015 11:40:55 -0700

* Rik van Riel <[email protected]> wrote:

> On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel <[email protected]> wrote:
> > 
> >>> I can understand people running hard-RT workloads not wanting to 
> >>> see the overhead of a timer tick or a scheduler tick with variable 
> >>> (and occasionally heavy) work done in IRQ context, but the jitter 
> >>> caused by a single trivial IPI with constant work should be very, 
> >>> very low and constant.
> >>
> >> Not if the realtime workload is running inside a KVM guest.
> > 
> > I don't buy this:
> > 
> >> At that point an IPI, either on the host or in the guest, involves a 
> >> full VMEXIT & VMENTER cycle.
> > 
> > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> > usec on recent hardware, and I bet it will get better with time.
> > 
> > I'm not aware of any hard-RT workload that cannot take 1 usec 
> > latencies.
> 
> Now think about doing this kind of IPI from inside a guest, to 
> another VCPU on the same guest.
> 
> Now you are looking at VMEXIT/VMENTER on the first VCPU,


Does it matter? It's not the hard-RT CPU, and this is a slowpath of 
synchronize_rcu().

> plus the cost of the IPI on the host, plus the cost of the emulation 
> layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI 
> work, and possibly a second VMEXIT/VMENTER for IPI completion.

Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies.

> I suspect it would be better to do RCU callback offload in some 
> other way.

Well, it's not just about callback offload, but it's about the basic 
synchronization guarantee of synchronize_rcu(): that all RCU read-side 
critical sections have finished executing after the call returns.

So even if a nohz-full CPU never actually queues a callback, it needs 
to stop using resources that a synchronize_rcu() caller expects it to 
stop using.

We can do that only if we know it in an SMP-coherent way that the 
remote CPU is not in an rcu_read_lock() section.

Sending an IPI is one way to achieve that.

Or we could do that in the syscall path with a single store of a 
constant flag to a location in the task struct. We have a number of 
natural flags that get written on syscall entry, such as:

        pushq_cfi $__USER_DS                    /* pt_regs->ss */

That goes to a constant location on the kernel stack. On return from 
system calls we could write 0 to that location.

So the remote CPU would have to do a read of this location. There are 
two cases:

 - If it's 0, then it has observed quiescent state on that CPU. (It 
   does not have to be atomics anymore, as we'd only observe the value 
   and MESI coherency takes care of it.)

 - If it's not 0 then the remote CPU is not executing user-space code 
   and we can install (remotely) a TIF_NOHZ flag in it and expect it 
   to process it either on return to user-space or on a context 
   switch.

This way, unless I'm missing something, reduces the overhead to a 
single store to a hot cacheline on return-to-userspace - which 
instruction if we place it well might as well be close to zero cost. 
No syscall entry cost. Slow-return cost only in the (rare) case of 
someone using synchronize_rcu().

Hm?

Thanks,

        Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

Reply via email to