On Tue, Jan 29, 2019 at 06:32:13PM -0800, Alexei Starovoitov wrote: > On Tue, Jan 29, 2019 at 10:16:54AM +0100, Peter Zijlstra wrote: > > On Mon, Jan 28, 2019 at 01:56:24PM -0800, Alexei Starovoitov wrote: > > > On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote: > > > > > > Ah, but the loop won't be in the BPF program itself. The BPF program > > > > would only have had the BPF_SPIN_LOCK instruction, the JIT them emits > > > > code similar to queued_spin_lock()/queued_spin_unlock() (or calls to > > > > out-of-line versions of them). > > > > > > As I said we considered exactly that and such approach has a lot of > > > downsides > > > comparing with the helper approach. > > > Pretty much every time new feature is added we're evaluating whether it > > > should be new instruction or new helper. 99% of the time we go with new > > > helper. > > > > Ah; it seems I'm confused on helper vs instruction. As in, I've no idea > > what a helper is. > > bpf helper is a normal kernel function that can be called from bpf program. > In assembler it's a direct function call.
Ah, it is what is otherwise known as a standard library, > > > > There isn't anything that mandates the JIT uses the exact same locking > > > > routines the interpreter does, is there? > > > > > > sure. This bpf_spin_lock() helper can be optimized whichever way the > > > kernel wants. > > > Like bpf_map_lookup_elem() call is _inlined_ by the verifier for certain > > > map types. > > > JITs don't even need to do anything. It looks like function call from bpf > > > prog > > > point of view, but in JITed code it is a sequence of native instructions. > > > > > > Say tomorrow we find out that > > > bpf_prog->bpf_spin_lock()->queued_spin_lock() > > > takes too much time then we can inline fast path of queued_spin_lock > > > directly into bpf prog and save function call cost. > > > > OK, so then the JIT can optimize helpers. Would it not make sense to > > have the simple test-and-set spinlock in the generic code and have the > > JITs use arch_spinlock_t where appropriate? > > I think that pretty much the same as what I have with qspinlock. > Instead of taking a risk how JIT writers implement bpf_spin_lock optimization > I'm using qspinlock on architectures that are known to support it. I see the argument for it... > So instead of starting with dumb test-and-set there will be faster > qspinlock from the start on x86, arm64 and few others archs. > Those are the archs we care about the most anyway. Other archs can take > time to optimize it (if optimizations are necessary at all). > In general hacking JITs is much harder and more error prone than > changing core and adding helpers. Hence we avoid touching JITs > as much as possible. So archs/JITs are not trivially able to override those helper functions? Because for example ARM (32bit) doesn't do qspinlock but it's arch_spinlock_t is _much_ better than a TAS lock. > Like map_lookup inlining optimization we do only when JIT is on. > And we do it purely in the generic core. See array_map_gen_lookup(). > We generate bpf instructions only to feed them into JITs so they > can replace them with native asm. That is much easier to implement > correctly than if we were doing inlining in every JIT independently. That seems prudent and fairly painful at the same time. I see why you would want to share as much of it as possible, but being restricted to BPF instructions seems very limiting. Anyway, I'll not press the issue; but I do think that arch specific helpers would make sense here.