Hi, Alexey and Herbert - thanks for the replies. Alexey wrote:
> > similar things on a router - it has died 3 or 4 times (over a period > > of a few months) with such an error with very little traffic passing > > through it and a stream of the 'dst cache overflow' errors on the screen. > > Actually, it is quite unusual. The problems with garbage collection > are all transient, essentially you do not see anything bad but > some annoying messages. > > If the machine dies... Well, it cannot be the reason of death. > I would even suspect that "dst cache overflow" was not reason of death, > but rather a consequense. > > If the death means just a loss of network connectivity, it could > mean that you experience a _true_ (not related to gc problems) > dst cache overflow i.e. it happens because some part of kernel leaks > dst cache entries. It is the first thing to check, see below. I think the machine was still alive; but all it does is route so there wasn't too much to tell; certainly it had stopped routing (most?) traffic a period of about 10 hours before I got to it and was still very ill - so it isn't a transient thing. The router handles outgoing traffic and routing between two small subnets (probably 200 ish IPs or so on each); it doesn't open any connections itself and it isn't directly on the outside world. One thought; the day before it fell into this state there had been a minor screw up on one of the networks where someone mispatched two subnets together (one of which was one of the ones connected to this box); now that may have caused a lot of arping and general unhappiness - but it all seemed to resolve itself; I don't think similar problems had happened before the previous failures. > > a patch by Denis Lunev that is currently in one of the 2.6.13-pre's > > ('Fix too aggressive backoff in dst garbage collection' git commit number > > f0098f7863f814a5adc0b9cb271605d063cad7fa ) > > It will not help, it is a transient problem. OK. > Plus run "ip route ls cache" periodically. OK, I'll add that to some monitoring. > The first thing, which you should watch is difference between > number of entries shown by "ip route ls cache" (alive entries) > and rtstat (it shows all, including lost ones). > > If the difference gradually grows with time, we definitely see a leakage. OK. > <explanation of route.c and dst.c> Thanks for that explanation - it helps somewhat - one thing I was confused by was why the timer mechanism for the garbage collection was so elaborate; why does it do all that back off stuff and adjusting itself? Why not just run at some fixed rate? * Herbert Xu ([EMAIL PROTECTED]) wrote: > Alexey Kuznetsov <[EMAIL PROTECTED]> wrote: > > > > Really bad overflow happens when lots of entries remain in use, because > > someone forgot to release the references to dst cache entries. > > It is the first thing to check. > > Yes. I once had a situation where a buggy user-land program held > many sockets open each of which had ancient packets stuck in their > receive queues. The result was a lot of dst entries hanging around. Nod - I don't think it is that in this case because the machine doesn't open any connections itself. > In such cases checking /proc/slabinfo could be useful. But I will try and remember that next time it goes or add it to the monitoring scripts. Thank you for your suggestions; if I'm unlucky you'll see a question from me (with some more debug) in a month or two if it does it again! Dave -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _________________________|_____ http://www.treblig.org |_______/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html