Sage Enhancement Proposal - Valgrind & Sage integration: Hunting memory leaks
Forward: Please excuse the rambling - I have meant to take more time to write this up but had meetings all day and need to finish a couple other work related things. Certain points may have been made more than once and others which I should have been made were omitted by accident. Feel free to chip in corrections or ask about details - I plan to move this documents (shall the idea get embraced) into the wiki. If there is enough interest I am also willing to do a presentation/demo about the subject at Sage Days 6. This document might benefit from further proofreading, all spelling errors are mine. -------------- I. The Problem -------------- As bug #274 clearly shows Sage has a problem with memory leaks inside the pyrex code. Those bugs come in various flavors (I am using the valgrind terminology): - lost memory - indirectly lost memory - missing deallocations I am not going to go into details about that at the moment, but plan to add proper explanations once I start moving this to the wiki. Another interesting issue has been bug #454 (not resolved yet) which I will discuss below in detail. It illustrates a problem found when running "ModularSymbols(501,2).decomposition(3)" -------------------- II. Sage vs. Valgrind -------------------- Valgrind Sage is not as staight forward as one would hope. Sage's Cython code sits underneath python and the python interpreter itself has some issues when it is being valgrinded. There is currently no way to limit the valgrinding to the Cython code, so we have to devise some technics to work around that. 1) The diff-approach: Start sage with the valgrind option and quit immediately. Then start sage with the valgrind option again and run some piece of code. Quit sage and compare the two logs. This requires some experience to spot the interesting bits of the log. The log itself is also quite big (on my Linux x86-64 box with sage 2.8.2pre the start & quit log of sage is about 5.5MB. If you hit some additional problems it might easily be another couple megabytes An example is ticket #454 (slightly edited because the commands were issued as one single line and I would like to avoid wrapping): sage: get_memory_usage() m = ModularSymbols(501,2).decomposition(3) del m ModularSymbols_clear_cache() get_memory_usage() 159.16015625 185.04296875 sage: get_memory_usage() m = ModularSymbols(501,2).decomposition(3) del m ModularSymbols_clear_cache() get_memory_usage() 185.04296875 190.796875 sage: get_memory_usage() m = ModularSymbols(501,2).decomposition(3) del m ModularSymbols_clear_cache() get_memory_usage() 190.796875 195.05859375 A valgrind run results in the following output (just the result): ==14358== LEAK SUMMARY: ==14358== definitely lost: 3,297,622 bytes in 194,267 blocks. ==14358== indirectly lost: 703,345 bytes in 4,603 blocks. ==14358== possibly lost: 387,022 bytes in 984 blocks. ==14358== still reachable: 153,105,982 bytes in 1,028,503 blocks. ==14358== suppressed: 0 bytes in 0 blocks. Compared to the empty run you see that there are some problems: ==2609== LEAK SUMMARY: ==2609== definitely lost: 2,500 bytes in 1 blocks. ==2609== possibly lost: 276,902 bytes in 769 blocks. ==2609== still reachable: 130,182,544 bytes in 159,833 blocks. ==2609== suppressed: 0 bytes in 0 blocks. 2) Finding meta leaks vs. finding specific leaks: A meta leak is a leak caused by an algorithm which can be compose of many specific leaks. It is usually easier to spot specific leaks because the code paths should be less complicated and therefore easier to deal with from the standpoint of complexity. Take the following code from ticket #274: K = GF(2^16, 'a') print type(K) R.<x> = K[] print type(R) s = K.random_element() def leak(n): m = get_memory_usage() for i in range(n): _ = R([1]) print get_memory_usage() - m leak(10000) The code illustrates a memory leak specific to a rather low level datastructure. ----------------------- III. Suggested Approach ----------------------- a) Measure the amount of memory used. Run your favorite piece of code repeatedly and print out the amount of memory consumed after each iteration. If it increases repeatedly open a ticket in the bug tracker. b) Method (a) can be used on algorithms as well as more simple code snippets. Tickets #454 and #274 illustrate those two different cases. c) Over time all subsystems and data structures should have some generic code that checks if repeated execution of the same command leaks memory. This is obviously a dream that might never come true, but at least one can hope. It is very important in my opinion that Sage cames as close to memory leak free as humanly possible because otherwise it might be difficult to run certain types of code. If Sage crashes due to a memory leak people will get a bad impression. The freeness of (major) memory leaks should also become a part of the review process. Correctness should come before features. d) There needs to be better documentation and also some tutorial on how to run sage in valgrind mode as well as interpret the result and most importantly how to avoid introducing memory leaks in the first place. This document is intended as a starting point to do that. If you look at many commercial as well as open source software projects you will be surprised how many of them don't run cleanly under valgrind. Sage should not be one of those programs. While I certainly don't wield any power in the Sage project to make people work on certain issues I hope that others will joins me on this journey. I am not very familiar with the pyrex code in Sage but somewhat an Idiot Savant with valgrind, so I hope that together we can get rid of those pesky memory leaks over time. ------------------------------------- IV. Detailed Analysis of ticket #454 ------------------------------------- The ModularSymbols(501,2).decomposition(3) command uses lots of linear algebra commands of which several do have leaks and other problems. So while the valgrind session reveals several interesting problems it is a good idea to decompose the meta leak into several specific leaks. More about this see (IV) For the complete log see http://sage.math.washington.edu/home/mabshoff/sage-2.8.1-ticket-454.valgrind (caution 6.5MB) Specifically this valgrind run reveals problems in the following areas: - with the dense matrix code: ==14358== 1,409,576 bytes in 176,197 blocks are definitely lost in loss record 7,571 of 7,579 ==14358== at 0x4A05809: malloc (vg_replace_malloc.c:149) ==14358== by 0x95A2697: __gmpz_init (in /tmp/Work2/sage-2.8.1/ sage-2.8.1/local/lib/libgmp.so.3.4.1) ==14358== by 0x206BC4EF: __pyx_f_20matrix_integer_dense_20Matrix_integer_dense__zero_out_matrix (matrix_integer_dense.c:4622) ==14358== by 0x206EA824: __pyx_f_20matrix_integer_dense_20Matrix_integer_dense___init__ (matrix_integer_dense.c:3740) ==14358== by 0x459FB1: type_call (typeobject.c:436) ==14358== by 0x4156B2: PyObject_Call (abstract.c:1860) ==14358== by 0x480413: PyEval_EvalFrameEx (ceval.c:3775) ==14358== by 0x48627F: PyEval_EvalCodeEx (ceval.c:2831) ==14358== by 0x484243: PyEval_EvalFrameEx (ceval.c:3660) ==14358== by 0x48627F: PyEval_EvalCodeEx (ceval.c:2831) ==14358== by 0x4CFA40: function_call (funcobject.c:517) ==14358== by 0x4156B2: PyObject_Call (abstract.c:1860) - with the sparse matrix code: ==14358== 65,552 (63,520 direct, 2,032 indirect) bytes in 12 blocks are definitely lost in loss record 7,511 of 7,579 ==14358== at 0x4A05809: malloc (vg_replace_malloc.c:149) ==14358== by 0x2018411A: __pyx_tp_new_18matrix_modn_sparse_Matrix_modn_sparse (matrix_modn_sparse.c:2793) ==14358== by 0x45A59A: tp_new_wrapper (typeobject.c:4022) ==14358== by 0x4156B2: PyObject_Call (abstract.c:1860) ==14358== by 0x47D801: PyEval_CallObjectWithKeywords (ceval.c:3433) ==14358== by 0x20D2F169: __pyx_f_21matrix_integer_sparse_21Matrix_integer_sparse__mod_int_c (matrix_integer_sparse.c:6968) ==14358== by 0x20D2B7CB: __pyx_f_21matrix_integer_sparse_21Matrix_integer_sparse__mod_int (matrix_integer_sparse.c:6883) ==14358== by 0x4156B2: PyObject_Call (abstract.c:1860) ==14358== by 0x47D801: PyEval_CallObjectWithKeywords (ceval.c:3433) ==14358== by 0x27BB59AA: __pyx_f_4misc_matrix_rational_echelon_form_multimodular (misc.c:11850) ==14358== by 0x4156B2: PyObject_Call (abstract.c:1860) ==14358== by 0x47D801: PyEval_CallObjectWithKeywords (ceval.c:3433) - with the LinBox Glue code: ==14358== 132,536 (3,640 direct, 128,896 indirect) bytes in 13 blocks are definitely lost in loss record 7,482 of 7,579 ==14358== at 0x4A06019: operator new(unsigned long) (vg_replace_malloc.c:167) ==14358== by 0x1FCDA467: LinBox::MatrixHom::BlasBlackboxMAP<LinBox::Modular<double>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> >, LinBox::MatrixContainerCategory::Container>::operator() (LinBox::BlasBlackbox<LinBox::Modular<double> >*&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::Modular<double> const&, LinBox::MatrixContainerCategory::Container) (in /tmp/Work2/sage-2.8.1/ sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FCDA685: void LinBox::MatrixHom::map<LinBox::Modular<double>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > >(LinBox::BlasBlackbox<LinBox::Modular<double> >*&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::Modular<double> const&) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/ local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FD28331: LinBox::GivPolynomial<Integer>& LinBox::cia<LinBox::GivPolynomial<Integer>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > >(LinBox::GivPolynomial<Integer>&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::BlasEliminationTraits const&) (in /tmp/Work2/sage-2.8.1/ sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FD28B97: LinBox::GivPolynomial<Integer>& LinBox::charpoly<LinBox::GivPolynomial<Integer>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > >(LinBox::GivPolynomial<Integer>&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::RingCategories::IntegerTag const&, LinBox::BlasEliminationTraits const&) (in /tmp/Work2/sage-2.8.1/ sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FD28BD9: LinBox::GivPolynomial<Integer>& LinBox::charpoly<LinBox::GivPolynomial<Integer>, LinBox::UnparametricField<Integer>, LinBox::RingCategories::IntegerTag>(LinBox::GivPolynomial<Integer>&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::RingCategories::IntegerTag const&, LinBox::HybridSpecifier const&) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/local/lib/ liblinboxwrap.so.0.0.0) ==14358== by 0x1FD28C0D: LinBox::GivPolynomial<Integer>& LinBox::charpoly<LinBox::DenseMatrix<LinBox::UnparametricField<Integer> >, LinBox::GivPolynomial<Integer>, LinBox::HybridSpecifier>(LinBox::GivPolynomial<Integer>&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::HybridSpecifier const&) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/ local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FD28C39: LinBox::GivPolynomial<Integer>& LinBox::charpoly<LinBox::DenseMatrix<LinBox::UnparametricField<Integer> >, LinBox::GivPolynomial<Integer> >(LinBox::GivPolynomial<Integer>&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&) (in / tmp/Work2/sage-2.8.1/sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FC73567: linbox_integer_dense_charpoly (in /tmp/ Work2/sage-2.8.1/sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1F98F30C: __pyx_f_6linbox_20Linbox_integer_dense__poly(_object*, _object*, _object*) (linbox.cpp:1121) ==14358== by 0x4156B2: PyObject_Call (abstract.c:1860) ==14358== by 0x47D801: PyEval_CallObjectWithKeywords (ceval.c:3433) In LinBox itself, even though I have never seen this in LinBox itself (yet): ==14358== 380,688 bytes in 27 blocks are indirectly lost in loss record 7,539 of 7,579 ==14358== at 0x4A06019: operator new(unsigned long) (vg_replace_malloc.c:167) ==14358== by 0x1FC9C896: __gnu_cxx::new_allocator<double>::allocate(unsigned long, void const*) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FC9C8BE: std::_Vector_base<double, std::allocator<double> >::_M_allocate(unsigned long) (in /tmp/Work2/ sage-2.8.1/sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FC9C8F0: std::_Vector_base<double, std::allocator<double> >::_Vector_base(unsigned long, std::allocator<double> const&) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/ local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FC9C975: std::vector<double, std::allocator<double> >::vector(unsigned long, double const&, std::allocator<double> const&) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FC9CA47: LinBox::DenseMatrixBase<double>::DenseMatrixBase(unsigned long, unsigned long) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/local/lib/ liblinboxwrap.so.0.0.0) ==14358== by 0x1FC9CC07: LinBox::BlasMatrix<double>::BlasMatrix(int, int) (in /tmp/Work2/ sage-2.8.1/sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FCCBDF9: LinBox::BlasBlackbox<LinBox::Modular<double> >::BlasBlackbox(LinBox::Modular<double> const&, unsigned long, unsigned long) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/local/lib/ liblinboxwrap.so.0.0.0) ==14358== by 0x1FCDA487: LinBox::MatrixHom::BlasBlackboxMAP<LinBox::Modular<double>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> >, LinBox::MatrixContainerCategory::Container>::operator() (LinBox::BlasBlackbox<LinBox::Modular<double> >*&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::Modular<double> const&, LinBox::MatrixContainerCategory::Container) (in /tmp/Work2/sage-2.8.1/ sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FCDA685: void LinBox::MatrixHom::map<LinBox::Modular<double>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > >(LinBox::BlasBlackbox<LinBox::Modular<double> >*&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::Modular<double> const&) (in /tmp/Work2/sage-2.8.1/sage-2.8.1/ local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FD28331: LinBox::GivPolynomial<Integer>& LinBox::cia<LinBox::GivPolynomial<Integer>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > >(LinBox::GivPolynomial<Integer>&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::BlasEliminationTraits const&) (in /tmp/Work2/sage-2.8.1/ sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== by 0x1FD28B97: LinBox::GivPolynomial<Integer>& LinBox::charpoly<LinBox::GivPolynomial<Integer>, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > >(LinBox::GivPolynomial<Integer>&, LinBox::DenseMatrix<LinBox::UnparametricField<Integer> > const&, LinBox::RingCategories::IntegerTag const&, LinBox::BlasEliminationTraits const&) (in /tmp/Work2/sage-2.8.1/ sage-2.8.1/local/lib/liblinboxwrap.so.0.0.0) ==14358== and so on. You can see all the gory details in the log. Thoughts? Cheers, Michael --~--~---------~--~----~------------~-------~--~----~ To post to this group, send email to sage-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sage-devel URLs: http://sage.scipy.org/sage/ and http://modular.math.washington.edu/sage/ -~----------~----~----~----~------~----~------~--~---