On Fri, May 24, 2019 at 05:51 PM CEST, John Fastabend wrote: > Jakub Sitnicki wrote: >> >> Now that those pesky crashes are gone, we plan to look into drops when >> doing echo with sockmap. Marek tried running echo-sockmap [1] with >> latest bpf-next (plus mentioned crash fixes) and reports that not all >> data bounces back: >> >> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c >> 971832 >> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c >> 867352 >> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c >> 952648 >> >> I'm tring to turn echo-sockmap into a selftest but as you can probably >> guess over loopback all works fine. >> > > Right, sockmap when used from recvmsg with redirect is lossy. This > was a design choice I made that apparently caught a few people > by surprise. The original rationale for this was when doing a > multiplex operation, e.g. single ingress socket to many egress > sockets blocking would cause head of line blocking on all > sockets. To resolve this I simply dropped the packet and then allow > the flow to continue. This pushes the logic up to the application > to do retries, etc. when this happens. FWIW userspace proxies I > tested also had similar points where they fell over and dropped > packets. In hind sight though it probably would have made more > sense to make this behavior opt-in vs the default. But, the > use case I was solving at the time I wrote this could handle > drops and was actually a NxM sockets with N ingress sockets and M > egress sockets so head of line blocking was a real problem. > > Adding a flag to turn this into a blocking op has been on my > todo list for awhile. Especially when sockmap is being used as > a single ingress to single egress socket then blocking vs dropping > makes much more sense. > > The place to look is in sk_psock_verdict_apply() in __SK_REDIRECT > case there is a few checks and notice we can fallthrough to a > kfree_skb(skb). This is most likely the drops you are hitting. > Maybe annotate it with a dbg statement to check. > > To fix this we could have a flag to _not_ drop but enqueue the > packet regardless of the test or hold it until space is > available. I even think sk_psock_strp_read could push back > on the stream parser which would eventually push back via TCP > and get the behavior you want. > > Also, I have a couple items on my TODO list that I'll eventually > get to. First we run without stream parsers in some Cilium > use cases. I'll push some patches to allow this in the next > months or so. This avoids the annoying stream parser prog that > simply returns skb->len. This is mostly an optimizations. A > larger change I want to make at some point is to remove the > backlog workqueue altogether. Originally it was added for > simplicity but actually causes some latency spikes when > looking at 99+ percentiles. It really doesn't need to be > there it was a hold over from some original architecture that > got pushed upstream. If you have time and want to let me know > if you would like to tackle removing it.
This is great stuff. Thanks for explaining the sockmap's design and decisions behind it. The opt-in blocking mode idea is spot on. I imagine we'll get back to sockmap once we have a replacement for TPROXY figured out (unrelated to sockmap). Let's sync then. -Jakub