Re: per-FIB socket binding

Mark Johnston Mon, 13 Jan 2025 11:00:14 -0800

On Sun, Jan 12, 2025 at 07:17:48AM +0000, Paul Vixie wrote:
> On Saturday, January 11, 2025 4:51:07 PM UTC Mark Johnston wrote:
> > On Sat, Jan 11, 2025 at 06:25:22AM +0000, Paul Vixie wrote:
> > > ... the SYN|ACK will always use the FIB from the interface where
> > > the SYN arrived (this is in tcp_syncache.c).
> > 
> > This isn't clear to me.  The initial SYN will create a syncache entry in the
> > 
> >   if (tp->t_state == TCPS_LISTEN && SOLISTENING(so))
> > 
> > case in tcp_input_with_port().  In this case we define inc.inc_fibnum =
> > so->so_fibnum, i.e., the FIB of the listening socket.  Then,
> > syncache_add() copies inc to sc_inc, so sc_inc.inc_fibnum for the
> > syncache entry comes from the listening socket, and syncache_respond()
> > sets the SYN|ACK mbuf FIB with M_SETFIB(m, sc->sc_inc.inc_fibnum).  What
> > am I missing?  (Yes, I should actually do an experiement to check the
> > behaviour.)
> 
> this is exactly my understanding of that code. the FIB for the SYN|ACK will 
> be 
> the one from the interface who received the SYN.


What I'm saying is that today, the FIB for the SYN|ACK will be that of
the listening socket, not that of the receiving interface.
sc->sc_inc.inc_fibnum comes from the listening socket, not the SYN.

I did a little test with a pair of epairs and a VNET jail to confirm
this.  If there is a listening socket on 0.0.0.0:8080 in FIB 0, and a
SYN to x.x.x.x:8080 arrives from FIB 1, the SYN|ACK will be sent using
FIB 0's routing table.

I'm not claiming that this behaviour is correct or desireable, just that
it's what happens today.  With your patch applied, this behaviour is
clearly wrong.

> it's only later after we 
> receive the ACK that completes the 3-way handshake that we will begin to use 
> the FIB of the socket. in terms of your earlier question, this means if we're 
> going to drop a SYN because it came in on the wrong interface (some form of 
> uRPF) then the firewall module (PF or IPFW) will be doing that to the SYN 
> well 
> before a syncache entry is ever created. this seems to imply that the changes 
> i'm proposing won't be able to permit connections to complete that would have 
> not have completed -- which you correctly predicted could be surprising.
> 
> the kernel implementations of uRPF-like features are unaffected.
> 
>                 /*
>                  * net.inet.ip.rfc1122_strong_es: the address matches, verify
>                  * that the packet arrived via the correct interface.
>                  */
>                 if (__predict_false(strong_es && ia->ia_ifp != ifp)) {
>                         IPSTAT_INC(ips_badaddr);
>                         goto bad;
>                 }
> 
> we're not changing the inbound packet's FIB.
> 
>                 /*
>                  * net.inet.ip.source_address_validation: drop incoming
>                  * packets that pretend to be ours.
>                  */
>                 if (V_ip_sav && !(ifp->if_flags & IFF_LOOPBACK) &&
>                     __predict_false(in_localip_fib(ip->ip_src, ifp->if_fib))) 
>                                         
>       {
>                         IPSTAT_INC(ips_badaddr);
>                         goto bad;
>                 }
> 
> unless i misunderstand, this only rejects packets if the source address is 
> the 
> same as one of the addresses of the interface it arrives on, whereas the 
> documentation made me expect that a source address which was any address of 
> any interface would be the trigger. i think my proposal doesn't change this.
> 
> therefore my focus was on the verrevpath, versrcreach, and antispoof features 
> of IPFW; and the antispoof feature of PF. in those tools, the inbound SYN 
> would be stopped before it ever reached the SYN cache.
> 
> what i then observed is that the only change resulting from my proposal would 
> be to the transmission of subsequent segments, which would use the 
> interface's 
> FIB (if nonzero) rather than the socket's FIB (if zero). if upstream uRPF 
> would have dropped these segments because they're coming from the wrong 
> place, 
> then indeed failure would result, and even FIN and RST could not be 
> transmitted, which would mean the socket would die a slow death by eventual 
> timeout. my proposal would prevent this, but i think any surprise in this 
> case 
> would be a welcome one. if actual segments could not be transmitted because 
> the socket's FIB did not contain a route back to the connection's source, 
> then 
> failure would be more immediate but the resulting surprise (if any) no less 
> positive.
> 
> i hope that if that analysis is correct no-one will demand a socket option or 
> sysctl. to be effective at fixing path symmetry for multihoming, this logic 
> has to be the default.
> 
> ---
> 
> having finished with accept(), i'm now testing the bind() and bindat() 
> changes, which ought to be effective for any UDP responder who uses a 
> separate 
> socket for each interface address it speaks through, which certainly includes 
> all NTP and DNS servers i'm aware of, and may also help QUIC.
> 
> i've begun to wish for a setfib_fd(2) which would let sshd promote the FIB 
> from an inbound connection to be process-wide (after fork() before exec().) 
> that way "netstat -rn" would show the routing table actually being used by 
> the 
> stdin and stdout of that shell. or perhaps we could just rename SO_SETFIB to 
> be SO_FIB and allow getsockopt() to return the FIB? (i'm not sure why it was 
> "set only" in the current implementation.) opinions welcomed, as before. 

I think the second option is fine, given that we have setfib(2) already:

diff --git a/lib/libsys/getsockopt.2 b/lib/libsys/getsockopt.2
index 15e4b76311d8..56f408b7c918 100644
--- a/lib/libsys/getsockopt.2
+++ b/lib/libsys/getsockopt.2
@@ -175,7 +175,8 @@ for the socket
 .It Dv SO_PROTOTYPE Ta "SunOS alias for the Linux SO_PROTOCOL (get only)"
 .It Dv SO_ERROR Ta "get and clear error on the socket (get only)"
 .It Dv SO_RERROR Ta "enables receive error reporting"
-.It Dv SO_SETFIB Ta "set the associated FIB (routing table) for the socket 
(set only)"
+.It Dv SO_FIB Ta "get or set the associated FIB (routing table) for the socket"
+.It Dv SO_SETFIB Ta "set the associated FIB (routing table) for the socket"
 .El
 .Pp
 The following options are recognized in
@@ -359,6 +360,8 @@ or with the error
 if no data were received.
 .Pp
 .Dv SO_SETFIB
+and
+.Dv SO_FIB
 can be used to over-ride the default FIB (routing table) for the given socket.
 The value must be from 0 to one less than the number returned from
 the sysctl
diff --git a/sys/kern/uipc_socket.c b/sys/kern/uipc_socket.c
index 61bd8a610353..3d1fb645c94d 100644
--- a/sys/kern/uipc_socket.c
+++ b/sys/kern/uipc_socket.c
@@ -4105,6 +4105,12 @@ sogetopt(struct socket *so, struct sockopt *sopt)
                        error = sooptcopyout(sopt, &optval, sizeof optval);
                        break;
 
+               case SO_FIB:
+                       SOCK_LOCK(so);
+                       optval = so->so_fibnum;
+                       SOCK_UNLOCK(so);
+                       goto integer;
+
                case SO_DOMAIN:
                        optval = so->so_proto->pr_domain->dom_family;
                        goto integer;
diff --git a/sys/sys/socket.h b/sys/sys/socket.h
index cc82d97e9dd5..55c5826744ed 100644
--- a/sys/sys/socket.h
+++ b/sys/sys/socket.h
@@ -165,6 +165,7 @@ typedef     __uintptr_t     uintptr_t;
 #define        SO_LISTENQLEN   0x1012          /* socket's complete queue 
length */
 #define        SO_LISTENINCQLEN        0x1013  /* socket's incomplete queue 
length */
 #define        SO_SETFIB       0x1014          /* use this FIB to route */
+#define        SO_FIB          SO_SETFIB       /* alias for getsockopt(2) */
 #define        SO_USER_COOKIE  0x1015          /* user cookie (dummynet etc.) 
*/
 #define        SO_PROTOCOL     0x1016          /* get socket protocol (Linux 
name) */
 #define        SO_PROTOTYPE    SO_PROTOCOL     /* alias for SO_PROTOCOL (SunOS 
name) */

Re: per-FIB socket binding

Reply via email to