Kuniyuki Iwashima wrote: > From: "Jiayuan Chen" <jiayuan.c...@linux.dev> > Date: Thu, 01 May 2025 06:22:17 +0000 > > 2025/5/1 12:42, "Kuniyuki Iwashima" <kun...@amazon.com> wrote: > > > > > > > > From: Jiayuan Chen <jiayuan.c...@linux.dev> > > > > > > Date: Thu, 1 May 2025 11:51:08 +0800 > > > > > > > > > > > For some services we are using "established-over-unconnected" model. > > > > > > > > > > > > > > > > ''' > > > > > > > > // create unconnected socket and 'listen()' > > > > > > > > srv_fd = socket(AF_INET, SOCK_DGRAM) > > > > > > > > setsockopt(srv_fd, SO_REUSEPORT) > > > > > > > > bind(srv_fd, SERVER_ADDR, SERVER_PORT) > > > > > > > > > > > > > > > > // 'accept()' > > > > > > > > data, client_addr = recvmsg(srv_fd) > > > > > > > > > > > > > > > > // create a connected socket for this request > > > > > > > > cli_fd = socket(AF_INET, SOCK_DGRAM) > > > > > > > > setsockopt(cli_fd, SO_REUSEPORT) > > > > > > > > bind(cli_fd, SERVER_ADDR, SERVER_PORT) > > > > > > > > connect(cli, client_addr) > > > > > > > > ... > > > > > > > > // do handshake with cli_fd > > > > > > > > ''' > > > > > > > > > > > > > > > > This programming pattern simulates accept() using UDP, creating a new > > > > > > > > socket for each client request. The server can then use separate > > > > sockets > > > > > > > > to handle client requests, avoiding the need to use a single UDP socket > > > > > > > > for I/O transmission. > > > > > > > > > > > > > > > > But there is a race condition between the bind() and connect() of the > > > > > > > > connected socket: > > > > > > > > We might receive unexpected packets belonging to the unconnected socket > > > > > > > > before connect() is executed, which is not what we need. > > > > > > > > (Of course, before connect(), the unconnected socket will also receive > > > > > > > > packets from the connected socket, which is easily resolved because > > > > > > > > upper-layer protocols typically require explicit boundaries, and we > > > > > > > > receive a complete packet before creating a connected socket.) > > > > > > > > > > > > > > > > Before this patch, the connected socket had to filter requests at > > > > recvmsg > > > > > > > > time, acting as a dispatcher to some extent. With this patch, we can > > > > > > > > consider the bind and connect operations to be atomic. > > > > > > > > > > SO_ATTACH_REUSEPORT_EBPF is what you want. > > > > > > The socket won't receive any packets until the socket is added to > > > > > > the BPF map. > > > > > > No need to reinvent a subset of BPF functionalities. > > > > > > > I think this feature is for selecting one socket, not filtering out certain > > sockets. > > > > Does this mean that I need to first capture all sockets bound to the same > > port, and then if the kernel selects a socket that I don't want to receive > > packets on, I'll need to implement an algorithm in the BPF program to > > choose another socket from the ones I've captured, in order to avoid > > returning that socket? > > Right. > > If you want a set of sockets to listen on the port, you can implement > as such with BPF; register the sockets to the BPF map, and if kernel pick > up other sockets and triggers the BPF prog, just return one of the > registerd sk. > > Even when you have connect()ed sockets on the same port, kernel will > fall back to the normal scoring to find the best one, and it's not a > problem as the last 'result' is one selected by BPF or a connected sk, > and the packet won't be routed to not-yet-registered unconnected sk. > > > > > > This looks like it completely bypasses the kernel's built-in scoring > > logic. Or is expanding BPF_PROG_TYPE_SK_REUSEPORT to have filtering > > capabilities also an acceptable solution?
Reuseport BPF exists because we want to avoid having to continue to add custom rules in C for each scenario. In this case, I did wonder whether it is possible to avoid hitting the soon-to-be connected socket with the standard reuseport algorithm in reuseport_select_sock_by_hash. Setting SO_INCOMING_CPU to a cpu on which no packets arrive will lower its priority relative to other sockets. It's a bit of a hack, but should work?