* Rik van Riel <r...@redhat.com> wrote: > On 05/01/2015 12:34 PM, Ingo Molnar wrote: > > > > * Rik van Riel <r...@redhat.com> wrote: > > > >>> I can understand people running hard-RT workloads not wanting to > >>> see the overhead of a timer tick or a scheduler tick with variable > >>> (and occasionally heavy) work done in IRQ context, but the jitter > >>> caused by a single trivial IPI with constant work should be very, > >>> very low and constant. > >> > >> Not if the realtime workload is running inside a KVM guest. > > > > I don't buy this: > > > >> At that point an IPI, either on the host or in the guest, involves a > >> full VMEXIT & VMENTER cycle. > > > > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 > > usec on recent hardware, and I bet it will get better with time. > > > > I'm not aware of any hard-RT workload that cannot take 1 usec > > latencies. > > Now think about doing this kind of IPI from inside a guest, to > another VCPU on the same guest. > > Now you are looking at VMEXIT/VMENTER on the first VCPU,
Does it matter? It's not the hard-RT CPU, and this is a slowpath of synchronize_rcu(). > plus the cost of the IPI on the host, plus the cost of the emulation > layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI > work, and possibly a second VMEXIT/VMENTER for IPI completion. Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies. > I suspect it would be better to do RCU callback offload in some > other way. Well, it's not just about callback offload, but it's about the basic synchronization guarantee of synchronize_rcu(): that all RCU read-side critical sections have finished executing after the call returns. So even if a nohz-full CPU never actually queues a callback, it needs to stop using resources that a synchronize_rcu() caller expects it to stop using. We can do that only if we know it in an SMP-coherent way that the remote CPU is not in an rcu_read_lock() section. Sending an IPI is one way to achieve that. Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS /* pt_regs->ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. So the remote CPU would have to do a read of this location. There are two cases: - If it's 0, then it has observed quiescent state on that CPU. (It does not have to be atomics anymore, as we'd only observe the value and MESI coherency takes care of it.) - If it's not 0 then the remote CPU is not executing user-space code and we can install (remotely) a TIF_NOHZ flag in it and expect it to process it either on return to user-space or on a context switch. This way, unless I'm missing something, reduces the overhead to a single store to a hot cacheline on return-to-userspace - which instruction if we place it well might as well be close to zero cost. No syscall entry cost. Slow-return cost only in the (rare) case of someone using synchronize_rcu(). Hm? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/