On Sat, 10 Apr 2010 11:13:11 -0400, micah anderson <mi...@debian.org> wrote: > On Sat, 10 Apr 2010 12:17:51 +0100, Ben Hutchings <b...@decadent.org.uk> > wrote: > > On Fri, 2010-04-09 at 23:38 -0400, micah anderson wrote: > > > On Sat, 10 Apr 2010 01:48:24 +0100, Ben Hutchings <b...@decadent.org.uk> > > > wrote: > > > > On Thu, 2010-04-08 at 12:41 -0400, micah anderson wrote: > > > > > On 2010-04-08, micah anderson wrote: > > > > > > On Wed, 2010-04-07 at 11:52 -0400, Micah Anderson wrote: > > > > > > > Package: linux-image-2.6.32-2-amd64 > > > > > > > Version: 2.6.32-8~bpo50+1 > > > > > > > Severity: important > > > > > > > > > > > > > > I'm running a tor exit node on a kvm instance, it runs for a > > > > > > > little > > > > > > > while (between an hour and 3 days), doing 30-40mbit/sec and then > > > > > > > suddenly 'swapper: page allocation failure' happens, and the > > > > > > > entire > > > > > > > networking stack of the kvm instance is dead. It stops responding > > > > > > > on > > > > > > > the net completely. No ping in or out, no traffic can be observed > > > > > > > using tcpdump, the counters on the interface no longer change > > > > > > > (although the interface stays up). > > > > > > [...] > > > > > > > > > > > > It sounds like there might be a memory leak. Please send the > > > > > > contents > > > > > > of /proc/meminfo and /proc/slabinfo from a 'normal' state and the > > > > > > broken > > > > > > state. > > > > > > > > > > I noticed this time when it crashed something different that I had not > > > > > seen in previous 2.6.30/2.6.26 kernels: > > > > > > > > > > [ 7962.841287] SLUB: Unable to allocate memory on node -1 (gfp=0x20) > > > > > [ 7962.841287] cache: kmalloc-1024, object size: 1024, buffer size: > > > > > 1024, default order: 1, min order: 0 > > > > > [ 7962.841287] node 0: slabs: 606, objs: 4544, free: 0 > > > > > > > > > > and then the normal: > > > > > [ 7963.102476] swapper: page allocation failure. order:0, mode:0x4020 > > > > > [ 7963.105743] Pid: 0, comm: swapper Not tainted 2.6.32-bpo.2-amd64 #1 > > > > > [ 7963.106418] Call Trace: > > > > > [ 7963.106418] <IRQ> [<ffffffff810b947d>] ? > > > > > __alloc_pages_nodemask+0x55b/0x5ce > > > > > etc. > > > > > > > > > > As requested here is a normal state /proc/meminfo and /proc/slabinfo. > > > > > See below for > > > > > the broken state > > > > [...] > > > > > > > > There's no sign of a memory leak and there's actually much more free > > > > memory in the broken state, perhaps because any network servers have > > > > lost all their clients and freed session state. My guess is that the > > > > driver just doesn't handle allocation failure gracefully. Which network > > > > driver are you using in the guest? > > > > > > I started with virtio, but had a hunch that maybe switching to e100e > > > might be more stable, but sadly both produce the same results. > > [...] > > > > There's no such thing as e100e - Linux has e100, e1000 and e1000e > > drivers; QEMU only emulates e1000. Please run lsmod inside the guest to > > check what's really being used. > > Indeed... it looks like regardless if I have specified 'e100e' in the > domain.xml, its ignoring that failure and providing me with: > > virtio_net 10433 0 > virtio 3277 5 > virtio_rng,virtio_balloon,virtio_net,virtio_blk,virtio_pci > > The host is running: > > e1000e 109487 0 > > So i guess I can try that...
Ok, I've been testing this for a couple of weeks now, and I can now say, with confidence, that the virtio net driver seems to be the culprit. When I run with the e1000e driver, I do not get this page fault at all. So that is a good work-around, but not a solution. It seems as if Redhat encountered and fixed this bug back in January: https://bugzilla.redhat.com/show_bug.cgi?id=554078 micah
pgp6XiIn1HSzW.pgp
Description: PGP signature