On Mon, Feb 18, 2013 at 15:45 +0100, Michael wrote: > Hi all, > > after having a somewhat weird problem for a while now I hope someone can > help me. _Sorry_ for the really lengthy mail but it is kind of complex > to describe. > > dmesg and other information can be found at the end. > > The problem in short: > Server keeps crashing hard (even ddb won't respond) after a more or less > random time when using a GRE tunnel inside IPsec (transport mode). > > Elaboration: > The setup consists of 3 OpenBSD boxes. One running OpenBSD 5.1 and the > other two OpenBSD 5.2 (upgrading from 5.1 to 5.2 didn't fix the issue. > Each box is directly connected to the internet with a public IP in a > different physical location (1x US, 2x DE). > > All 3 boxes are connected with an IPsec tunnel, like a triangle (a<->b, > b<->c, a<->c). Inside the IPsec tunnel is a GRE tunnel with OSPF on top > for dynamic routing. > > Two of those systems got a softraid0 crypto partition running, the third > one doesn't. (More on why that might be important later). > > When all 3 boxes are powered up everything is working perfectly fine, > but after some random interval (can be minutes, can be days) one or two > of the boxes crash, showing the ddb console but not letting me type > anything in. > > When the 2 boxes are rebooted, the game starts anew. > > In case only one of the boxes crashed, I can by now predict a 99% change > that the second one will crash shortly after the first one was fully > rebooted. > > Now, it is only ever 1 or 2 boxes that crash and so far it never has > been the box WITHOUT the softraid0 crypto volume. > > Out of curiosity I also created a crypto volume on the third box and put > it to some use (squid cache parition) and sure enough, now the third box > sometimes crashed too. > > When doing some tests (with only having two systems using a crypto > partition) I also noticed that there are no crashes at all if there is > only a single IPsec tunnel active between two of the boxes (one box with > crypto partition, the other without) and GRE encrypted inside and the > other GRE tunnels are unencrypted. > > To not play around too much with the production systems I tried to > replicate the issue with 3 VirtualBox VM and the latest OpenBSD 5.3 > snapshot, but VirtualBox instantly throws a GURU MEDITATION ERROR > whenever I try to push a file (1 MB is enough) over an encrypted GRE > tunnel using scp or netcat from one machine to the other. When I turn of > IPsec, the transfer works, no crashing. > > I only have console access to two of the boxes and whenever a system > crashes, it displays a very short message which is always a little > different, the only consistent part is the mentioning of "Stopped at > __mp_lock". That system is running OpenBSD 5.1, bsd.mp. > > I hope someone has an idea what might be going on. > > Thanks, > Michael > > PS: If someone wants to play around with my three VirtualBox test VMs > you can download them here: > > https://ssl.bsdhost.eu/owncloud/public.php?service=files&t=17b7472546546a617a3358ef9d953a4c >
hi, there appears to be some spls missing in net/if_gre.c code. netinet/ip_gre.c looks almose sane (gre_input should be called at splsoftnet from ip_input), yet gre_usrreq calls rip_usrreq that doesn't do splsoftnet itself (yuck!). reminds of the recent raw_usrreq change. anyways, both diffs are attached. please try them and see if they help out. cheers diff --git sys/net/if_gre.c sys/net/if_gre.c index 7a9eeee..84f0f0e 100644 --- sys/net/if_gre.c +++ sys/net/if_gre.c @@ -679,12 +679,15 @@ void gre_keepalive(void *arg) { struct gre_softc *sc = arg; + int s; if (!sc->sc_ka_timout) return; sc->sc_ka_state = GRE_STATE_DOWN; + s = splnet(); gre_link_state(sc); + splx(s); } void @@ -747,6 +750,8 @@ gre_send_keepalive(void *arg) void gre_recv_keepalive(struct gre_softc *sc) { + int s; + if (!sc->sc_ka_timout) return; @@ -762,7 +767,9 @@ gre_recv_keepalive(struct gre_softc *sc) case GRE_STATE_HOLD: if (--sc->sc_ka_holdcnt < 1) { sc->sc_ka_state = GRE_STATE_UP; + s = splnet(); gre_link_state(sc); + splx(s); } break; case GRE_STATE_UP: diff --git sys/netinet/raw_ip.c sys/netinet/raw_ip.c index 61285a8..050529f 100644 --- sys/netinet/raw_ip.c +++ sys/netinet/raw_ip.c @@ -396,7 +396,7 @@ int rip_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf *nam, struct mbuf *control, struct proc *p) { - int error = 0; + int s, error = 0; struct inpcb *inp = sotoinpcb(so); #ifdef MROUTING extern struct socket *ip_mrouter; @@ -410,6 +410,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf *nam, goto release; } + s = splsoftnet(); switch (req) { case PRU_ATTACH: @@ -532,6 +533,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf *nam, /* * stat: don't bother with a blocksize. */ + splx(s); return (0); /* @@ -556,6 +558,7 @@ rip_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf *nam, default: panic("rip_usrreq"); } + splx(s); release: if (m != NULL) m_freem(m);