On 01/14/2017 02:48 AM, Jakub Libosvar wrote: > recently I noticed we got oom-killer in action in one of our jobs [1].
> Any other ideas? I spent quite a while chasing down similar things with centos a while ago. I do have some ideas :) The symptom is probably that mysql gets chosen by the OOM killer but it's unlikely to be mysql's fault, it's just big and a good target. If the system is going offline, I added the ability to turn on the netconsole in devstack-gate with [1]. As the comment mentions, you can put little tests that stream data in /dev/kmsg and they will generally get off the host, even if ssh has been killed. I found this very useful for getting the initial oops data (i've used this several times for other gate oopses, including other kernel issues we've seen). For starting to pin down what is really consuming the memory, the first thing I did was wrote a peak-memory usage tracker that gave me stats on memory growth during the devstack run [2]. You have to enable this with "enable_service peakmem_tracker". This starts to give you the big picture of where memory is starting to go. At this point, you should have a rough idea of the real cause, and you're going to want to start dumping /proc/pid/smaps of target processes to get an idea of where the memory they're allocating is going, or at the very least what libraries might be involved. The next step is going to depend on what you need to target... If it's python, it can get a bit tricky to see where the memory is going but there's a number of approaches. At the time, despite it being mostly unmaintained but I had some success with guppy [1]. In my case, for example, I managed to hook into swift's wsgi startup and run that under guppy, giving me the ability to get some heap stats. from my notes [4] that looked something like --- import signal, os from guppy import hpy def handler(signum, frame): f = open('/tmp/heap.txt', 'w+') f.write("testing\n") hp = hpy() f.write(str(hp.heap())) f.close() if __name__ == '__main__': conf_file, options = parse_options() signal.signal(signal.SIGUSR1, handler) sys.exit(run_wsgi(conf_file, 'object-server', global_conf_callback=server.global_conf_callback, **options)) --- There are of course other tools from gdb to malloc tracers, etc. But that was enough that I could try different things and compare the heap usage. Once you've got the smoking gun ... well then the hard work starts of fixing it :) In my case it was pycparser and we came up with a good solution [5]. Hopefully that's some useful tips ... #openstack-infra can of course help holding vms etc as required. -i [1] http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n438 [2] https://git.openstack.org/cgit/openstack-dev/devstack/tree/tools/peakmem_tracker.sh [3] https://pypi.python.org/pypi/guppy/ [4] https://etherpad.openstack.org/p/oom-in-rax-centos7-CI-job [5] https://github.com/eliben/pycparser/issues/72 __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev