On Mon, Feb 18, 2013 at 15:45 +0100, Michael wrote:
> Hi all,
> 
> after having a somewhat weird problem for a while now I hope someone can
> help me. _Sorry_ for the really lengthy mail but it is kind of complex
> to describe.
> 
> dmesg and other information can be found at the end.
> 
> The problem in short:
> Server keeps crashing hard (even ddb won't respond) after a more or less
> random time when using a GRE tunnel inside IPsec (transport mode).
> 
> Elaboration:
> The setup consists of 3 OpenBSD boxes. One running OpenBSD 5.1 and the
> other two OpenBSD 5.2 (upgrading from 5.1 to 5.2 didn't fix the issue.
> Each box is directly connected to the internet with a public IP in a
> different physical location (1x US, 2x DE).
> 
> All 3 boxes are connected with an IPsec tunnel, like a triangle (a<->b,
> b<->c, a<->c). Inside the IPsec tunnel is a GRE tunnel with OSPF on top
> for dynamic routing.
> 
> Two of those systems got a softraid0 crypto partition running, the third
> one doesn't. (More on why that might be important later).
> 
> When all 3 boxes are powered up everything is working perfectly fine,
> but after some random interval (can be minutes, can be days) one or two
> of the boxes crash, showing the ddb console but not letting me type
> anything in.
> 
> When the 2 boxes are rebooted, the game starts anew.
> 
> In case only one of the boxes crashed, I can by now predict a 99% change
> that the second one will crash shortly after the first one was fully
> rebooted.
> 
> Now, it is only ever 1 or 2 boxes that crash and so far it never has
> been the box WITHOUT the softraid0 crypto volume.
> 
> Out of curiosity I also created a crypto volume on the third box and put
> it to some use (squid cache parition) and sure enough, now the third box
> sometimes crashed too.
> 
> When doing some tests (with only having two systems using a crypto
> partition) I also noticed that there are no crashes at all if there is
> only a single IPsec tunnel active between two of the boxes (one box with
> crypto partition, the other without) and GRE encrypted inside and the
> other GRE tunnels are unencrypted.
> 
> To not play around too much with the production systems I tried to
> replicate the issue with 3 VirtualBox VM and the latest OpenBSD 5.3
> snapshot, but VirtualBox instantly throws a GURU MEDITATION ERROR
> whenever I try to push a file (1 MB is enough) over an encrypted GRE
> tunnel using scp or netcat from one machine to the other. When I turn of
> IPsec, the transfer works, no crashing.
> 
> I only have console access to two of the boxes and whenever a system
> crashes, it displays a very short message which is always a little
> different, the only consistent part is the mentioning of "Stopped at
> __mp_lock". That system is running OpenBSD 5.1, bsd.mp.
> 
> I hope someone has an idea what might be going on.
> 
> Thanks,
> Michael
> 
> PS: If someone wants to play around with my three VirtualBox test VMs
> you can download them here:
> 
> https://ssl.bsdhost.eu/owncloud/public.php?service=files&t=17b7472546546a617a3358ef9d953a4c
> 

hi,

there appears to be some spls missing in net/if_gre.c code.
netinet/ip_gre.c looks almose sane (gre_input should be
called at splsoftnet from ip_input), yet gre_usrreq calls
rip_usrreq that doesn't do splsoftnet itself (yuck!).
reminds of the recent raw_usrreq change.  anyways, both
diffs are attached.  please try them and see if they help
out.

cheers

diff --git sys/net/if_gre.c sys/net/if_gre.c
index 7a9eeee..84f0f0e 100644
--- sys/net/if_gre.c
+++ sys/net/if_gre.c
@@ -679,12 +679,15 @@ void
 gre_keepalive(void *arg)
 {
        struct gre_softc *sc = arg;
+       int s;
 
        if (!sc->sc_ka_timout)
                return;
 
        sc->sc_ka_state = GRE_STATE_DOWN;
+       s = splnet();
        gre_link_state(sc);
+       splx(s);
 }
 
 void
@@ -747,6 +750,8 @@ gre_send_keepalive(void *arg)
 void
 gre_recv_keepalive(struct gre_softc *sc)
 {
+       int s;
+
        if (!sc->sc_ka_timout)
                return;
 
@@ -762,7 +767,9 @@ gre_recv_keepalive(struct gre_softc *sc)
        case GRE_STATE_HOLD:
                if (--sc->sc_ka_holdcnt < 1) {
                        sc->sc_ka_state = GRE_STATE_UP;
+                       s = splnet();
                        gre_link_state(sc);
+                       splx(s);
                }
                break;
        case GRE_STATE_UP:
diff --git sys/netinet/raw_ip.c sys/netinet/raw_ip.c
index 61285a8..050529f 100644
--- sys/netinet/raw_ip.c
+++ sys/netinet/raw_ip.c
@@ -396,7 +396,7 @@ int
 rip_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf *nam,
     struct mbuf *control, struct proc *p)
 {
-       int error = 0;
+       int s, error = 0;
        struct inpcb *inp = sotoinpcb(so);
 #ifdef MROUTING
        extern struct socket *ip_mrouter;
@@ -410,6 +410,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m, 
struct mbuf *nam,
                goto release;
        }
 
+       s = splsoftnet();
        switch (req) {
 
        case PRU_ATTACH:
@@ -532,6 +533,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m, 
struct mbuf *nam,
                /*
                 * stat: don't bother with a blocksize.
                 */
+               splx(s);
                return (0);
 
        /*
@@ -556,6 +558,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m, 
struct mbuf *nam,
        default:
                panic("rip_usrreq");
        }
+       splx(s);
 release:
        if (m != NULL)
                m_freem(m);

Reply via email to