On Fri, Jun 08, 2018 at 10:41:37AM +0200, Kristian Evensen wrote: > Hi, > > On Wed, Jun 6, 2018 at 6:03 PM, Tobias Hommel <netdev-l...@genoetigt.de> > wrote: > > Sorry no progress until now, I currently do not get time to have a deeper > > look > > into that. We're back to 4.1.6 right now. > > Thanks for letting me know. In the project I am currently involved in, > we unfortunately don't have the option of reverting the kernel, so we > are finding ways to live with the error. We have been looking into the > error a bit more, and have made the following observations: > > * First of all, as discussed earlier in the thread, the error is > triggered by dst_orig being NULL. Our current work-around is just to > return from xfrm_lookup if dst_orig is NULL and this seems to work > fine, the error doesn't happen that often (in our use-cases at least). > * The machine we use for testing (and where we first saw the error) is > used as initiator. The machine where I encountered the bug is a "roadwarrior gateway", so it only serves as a responder.
> * When we compare the logs from Strongswan with the ones from the > kernel, it seems that the error is typically triggered when a tunnels > is teared down/about to come up. We need quite a lot of tunnels for > the error to trigger, usually around 30+. I guess this might point to > some race or some condition not being met when packets are > sent/received. > * We see the error much more frequently when hardware encryption is enabled. > * Yesterday, we upgraded the kernel from 4.14.34 to 4.14.48, and the > error happens much less frequently. I see that 4.14.48 includes > several IPsec fixes (for example the previously mentioned ("xfrm: Fix > a race in the xdst pcpu cache.")). > > BR, > Kristian