Georgi, sorry for misspelling your name. - Marc
On Tuesday, September 3, 2024 at 12:56:49 PM UTC-5 Marc Culler wrote: > That is a pretty interesting exchange, it indicates that this memory leak > has been known to exist for 6 years, and the remedy is also discussed > there. Details and answers to questions asked in this thread follow. > > To answer Dima's question first, the Pari documentation that I am > referring to is the User's Guide to the PARI library (libpari.pdf). Now > for the details ... > > The Pari GEN constructed by pari.ellinit is what Pari calls a "lazy > vector". These are discussed in section 12.3, which begins by saying > "The functions in this section are used to implement ell structures and > analogous objects, which are vectors some of whose components are > initialized to dummy values, later computed on demand." > The only example known to me of a lazy vector is the GEN constructed by the > ellinit function. > > A "Technical Note" in thst section contains more detail: > "In the current implementation, S is a t_VEC with d + 1 entries. The first > d components are ordinary t_GEN entries, which you can read or assign to in > the customary way. But the last component gel(S, d + 1), a t_VEC of length > n initialized to zerovec(n), must be handled in a special way: you should > never access or modify its components directly, only through the API we are > about to describe." > > You can see an example of this behavior in Sage as follows: > > sage: import cypari2 > sage: pari = cypari2.Pari() > sage: e = pari.ellinit([11, 0]); e > [0, 0, 0, 11, 0, 0, 22, 0, -121, -528, 0, -85184, 1728, Vecsmall([1]), > [Vecsmall([64, -1])], [0, 0, 0, 0, 0, 0, 0, 0]] > > The last entry of e is a t_VEC containing 8 references to the PARI Zero > GEN, created by obj_init. The ellrootno function fills in this vector, > replacing the Zero placeholders by pointers to Pari GENS on the heap. To > see this in sage: > > sage: pari.ellrootno(e) > 1 > sage: e > [0, 0, 0, 11, 0, 0, 22, 0, -121, -528, 0, -85184, 1728, Vecsmall([1]), > [Vecsmall([64, -1])], [0, 0, 0, 0, 0, [3872, 4, [2, 5; 11, 2], [[5, 3, 0, > 2], [2, 3, 0, 2]]], [1, Vecsmall([-1, 1])], [[2, 11]~]]] > > To answer Giorgi's question one can observe that, while ellrootno > *usually* fills in the lazy components, sometimes it doesn't need to: > sage: e = pari.ellinit([2**70, 0]); e > [0, 0, 0, 1180591620717411303424, 0, 0, 2361183241434822606848, 0, > -1393796574908163946345982392040522594123776, -56668397794435742564352, 0, > -105312291668557186697918027683670432318895095400549111254310977536, 1728, > Vecsmall([1]), [Vecsmall([64, -1])], [0, 0, 0, 0, 0, 0, 0, 0]] > sage: pari.ellrootno(e) > 1 > sage: e > [0, 0, 0, 1180591620717411303424, 0, 0, 2361183241434822606848, 0, > -1393796574908163946345982392040522594123776, -56668397794435742564352, 0, > -105312291668557186697918027683670432318895095400549111254310977536, 1728, > Vecsmall([1]), [Vecsmall([64, -1])], [0, 0, 0, 0, 0, [32, 4, Mat([2, 5]), > [[5, -7, 0, 4]]], [1, Vecsmall([-1])], [[2]~, [131072, 0, 0, 0], [0, 0, 0, > 4, 0, 0, 8, 0, -16, -192, 0, -4096, 1728, Vecsmall([1]), [Vecsmall([64, > -1])], [0, 0, 0, 0, 0, 0, 0, 0]]]]] > > (Note that the vector in the last entry is still all zeroes.) > > In any case, the heap GENs added lazily by ellrootno cause the memory > leak. If the PARI GEN managed by the Python Gen e is passed to gunclone > (as happens in the Gen.__dealloc__ method) those added heap GENs are not > freed. However, the documented API addresses this issue via the function > obj_free: > "void obj_free(GEN S) destroys all clones stored in the n tagged > components, and replace them by the initial value 0. The regular entries of > S are unaffected, and S remains a valid object. This is used to avoid > memory leaks." > > Thus the fix for this memory leak (now implemented in CyPari) is to track > which Python Gen objects are managing "lazy vector" GENs, and to modify the > Gen.__dealloc__ method to make it call obj_free before calling gunclone for > those objects. Unfortunately, if obj_free is passed a GEN which is not a > lazy vector, i.e. was not initialized by obj_init, then a segfault results. > So the tracking is essential. In CyPari I added a cdef field to the Gen > class which is used to track whether the GEN is a lazy vector. I had to > modify the auto_gen code to allow for setting that flag for all such GENs. > (Currently that means only the GENs produced by calling ellinit. I > couldn't find any others, although obj_init does seem to be called by > rnfinit as well.) > > - Marc > > On Monday, September 2, 2024 at 12:13:43 PM UTC-5 dim...@gmail.com wrote: > >> It appears that all these gunclone_deep() etc aren't documented in any >> proper way (and this is a root cause of this; same applies to memleaks we >> have in libsingular interface). Can you post a link to docs you are reading? >> >> I can only find an email exchange from 2018 where >> similar issues are discussed. >> It seems that gunclone_deep should be used sparingly. >> The last message,from Jeroen Demeyer, in that thread says: >> >> > On 2018-08-20 18:14, Karim Belabas wrote: >> >> In libpari code, it's usually easy to explicitly call a destructor [ e.g. >> obj_free ] any time a function such as 'ellinit' was used >> >> > It's not easy if the list of functions "such as 'ellinit'" is not >> known. >> >> <https://pari.math.u-bordeaux.fr/archives/pari-dev-1808/msg00012.html> >> >> >> On 2 September 2024 03:30:14 BST, Marc Culler <marc....@gmail.com> wrote: >> >>> Unsurprisingly, changing gunclone to gunclone_deep in Gen.__dealloc__ is >>> not a full fix. It stops this leak but it also causes a crash when running >>> the doctests. On macOS the crash error is "freeing a pointer which was not >>> allocated". On linux the error is "corrupted size vs. prev_size in >>> fastbins". >>> >>> I won't force this list to participate in the full debugging experience. >>> The basic problem has been identified. The remedy is far from clear. >>> >>> - Marc >>> >>> On Sun, Sep 1, 2024, 5:37 PM Marc Culler <marc....@gmail.com> wrote: >>> >>>> This is now fixed in cypari. The fix consisted of replacing gunclone >>>> by gunclone_deep in the Gen.__dealloc__ method. The same change would >>>> probably help in cypari2, although I would not expect it to fix the other >>>> memory leaks reported in issue #112. >>>> >>>> Here is the explanation. The Pari GEN returned by pari.ellinit is of >>>> type t_VEC with entries of various types including t_INT and t_VECSMALL. >>>> The function pari.ellrootno returns a Python int, but it has side effects. >>>> >>>> It also inserts items into the t_VEC GEN representing the elliptic curve. >>>> That t_VEC GEN is managed by a Python Gen object, and that Python Gen >>>> object would pass the t_VEC GEN to gunclone in its __dealloc method. >>>> However, that was not sufficient to cause the new items which were >>>> inserted >>>> into the t_VEC gen by pari.ellrootno to also be uncloned. Consequently, >>>> those extra GENs remained on the Pari heap forever. Once the python >>>> object >>>> was gone, there was no event which would cause them to be destroyed. >>>> >>>> If you read the Pari documentation you will "learn" that gunclone_deep >>>> is only useful in the context of the peculiar memory model used by gp. >>>> Well, apparently it is also useful in the context of the peculiar memory >>>> model used by cypari and cypari2. >>>> >>>> I tested the cypari fix on linux ubuntu using the test function >>>> described earlier in this thread. With 10**8 iterations it never used >>>> over >>>> 8.4MB of memory. >>>> >>>> - Marc >>>> >>>> On Sunday, September 1, 2024 at 12:07:26 PM UTC-6 Marc Culler wrote: >>>> >>>>> My experiments with cypari so far indicate the opposite of what I >>>>> expected. (This is with one small but significant bug fix that I found >>>>> along the way.) The main observations are: >>>>> (1) Without the call to pari.ellrootno there is no memory leak. >>>>> The call to pari.getheap returns the same value before and after creating >>>>> many elliptic curves in a for loop. >>>>> (2) Calling pari.ellrootno(e) in a loop, with the same elliptic >>>>> curve e, also does not increase the Pari heap size. >>>>> (3) With both the call to pari.ellinit and the call to >>>>> pari.ellrootno in the loop, every Python Gen object created in the loop >>>>> gets deallocated and its Pari GEN gets uncloned with reference count 1, >>>>> meaning it is removed from the heap. Nonetheless, the heap grows >>>>> considerably during the loop. >>>>> >>>>> I conclude that Pari is creating and cloning its own GENs (i.e. GENs >>>>> not managed by any Python object) during the call to ellrootno and is not >>>>> uncloning all of those GENs. But it only does that with a newly created >>>>> curve. >>>>> >>>>> >>>>> - Marc >>>>> On Sunday, September 1, 2024 at 10:24:33 AM UTC-6 Marc Culler wrote: >>>>> >>>>>> I would say that the getheap behavior is a symptom of the same memory >>>>>> management bug(s). Also, it is not as simple as you suggest. It is not >>>>>> always true that every call to getheap leaves something on the heap: >>>>>> >>>>>> >>> from cypari import pari >>>>>> >>> pari.getheap() >>>>>> [4, 41] >>>>>> >>> pari.getheap() >>>>>> [5, 58] >>>>>> >>> pari.getheap() >>>>>> [5, 58] >>>>>> >>> pari.getheap() >>>>>> [5, 58] >>>>>> >>>>>> It is also not true that the cypari and cypari2 behaviors are >>>>>> equivalent, although they are similar. (Running the test loop with >>>>>> 10**5 >>>>>> iterations does not use anything close to 28 GB with cypari). I believe >>>>>> that some, but definitely not all, of the memory management issues -- >>>>>> namely the most glaring ones described in issue #112 -- were improved in >>>>>> cypari by removing the code which tries to keep wrapped Pari GENs on the >>>>>> stack instead of moving them to the heap. >>>>>> >>>>>> The behavior where Pari GENs go on the heap and never get removed is >>>>>> intertwined with python memory management. Python and the Pari heap use >>>>>> independent reference counting schemes. The __dealloc__ method of a >>>>>> Python >>>>>> Gen object calls gunclone to reduce the reference count of the Pari heap >>>>>> GEN which is referenced by the python Gen object. However, as >>>>>> demonstred >>>>>> in issue #112, cypari2 was creating other internal Python objects which >>>>>> held references to a Python Gen wrapping a vector or matrix entry. That >>>>>> prevented those Python Gens from being destroyed and therefore prevented >>>>>> the Pari GENs wrapped by those Python Gens from being removed from the >>>>>> Pari >>>>>> heap. >>>>>> >>>>>> Apart from the initial call to gclone when a Python Gen is created, I >>>>>> haven't found any code within cypari which could increase the reference >>>>>> count of a Pari GEN on the Pari heap. Unless Pari sometimes increases >>>>>> that >>>>>> reference count, this would suggest that the core issue lies with Python >>>>>> reference counts. Evidently Python Gen objects are not being dealloc'ed >>>>>> because other Python objects hold references to them, and this is >>>>>> preventing Pari from removing GENs from its heap. >>>>>> >>>>>> On Saturday, August 31, 2024 at 1:11:45 PM UTC-5 dim...@gmail.com >>>>>> wrote: >>>>>> >>>>>>> On Sat, Aug 31, 2024 at 4:35 AM Marc Culler <marc....@gmail.com> >>>>>>> wrote: >>>>>>> > >>>>>>> > As Dima says, and as the issue he mentions supports, the current >>>>>>> cypari2 code which attempts to keep Pari Gens on the Pari stack as much >>>>>>> as >>>>>>> possible is badly broken. There are many situations where Python Gen >>>>>>> objects cannot be garbage-collected after being destroyed. I am sure >>>>>>> that >>>>>>> is a big part of this problem. But I don't think it is the whole story. >>>>>>> > >>>>>>> > CyPari has returned to the older design which moves the Pari Gen >>>>>>> wrapped by a Python Gen to the pari heap when the python object is >>>>>>> created. >>>>>>> This eliminates the leaks reported in cypari2 issue #112. But in this >>>>>>> context, I am seeing 12 GB of memory (including several gigabytes of >>>>>>> swap) >>>>>>> in use after I do the following in ipython: >>>>>>> > >>>>>>> > In [1]: from cypari import * >>>>>>> > In [2]: def test(N): >>>>>>> > ...: for a in range(1, N): >>>>>>> > ...: e = pari.ellinit([a, 0]) >>>>>>> > ...: m = pari.ellrootno(e) >>>>>>> > In [3]: %time test(10**5) >>>>>>> > CPU times: user 699 ms, sys: 38.3 ms, total: 737 ms >>>>>>> > Wall time: 757 ms >>>>>>> > In [4]: %time test(10**6) >>>>>>> > CPU times: user 7.47 s, sys: 392 ms, total: 7.86 s >>>>>>> > Wall time: 7.93 s >>>>>>> > In [5]: %time test(10**7) >>>>>>> > CPU times: user 1min 41s, sys: 6.62 s, total: 1min 47s >>>>>>> > Wall time: 1min 49s >>>>>>> >>>>>>> the picture is very similar to cypari2. >>>>>>> (with cypari2, one has to run >>>>>>> >>>>>>> pari=Pari() >>>>>>> >>>>>>> after the >>>>>>> >>>>>>> from cypari2 import * >>>>>>> >>>>>>> but otherwise it's the same code) >>>>>>> >>>>>>> You can inspect the Pari heap, by calling >>>>>>> pari.getheap() >>>>>>> >>>>>>> each call to test() gets you about 3 new objects per call on the >>>>>>> heap, >>>>>>> so after 10^5 calls you get around 300000 >>>>>>> objects there (and with 10^6 calls, around 3 million are added). The >>>>>>> following is with cypari >>>>>>> >>>>>>> In [4]: %time test(10**5) >>>>>>> CPU times: user 1.46 s, sys: 85.8 ms, total: 1.55 s >>>>>>> Wall time: 1.55 s >>>>>>> In [5]: pari.getheap() >>>>>>> Out[5]: [300001, 14394782] >>>>>>> In [6]: %time test(10**6) >>>>>>> CPU times: user 14.9 s, sys: 756 ms, total: 15.7 s >>>>>>> Wall time: 15.7 s >>>>>>> In [7]: pari.getheap() >>>>>>> Out[7]: [3299999, 163655656] >>>>>>> >>>>>>> With cypari2, similar: >>>>>>> >>>>>>> In [9]: pari.getheap() # 10^5 >>>>>>> Out[9]: [299969, 14392931] >>>>>>> >>>>>>> In [12]: pari.getheap() # 10^6 >>>>>>> Out[12]: [3299662, 163635286] >>>>>>> >>>>>>> >>>>>>> And gc.collect() does not do anything, in either case, Pari heap >>>>>>> remains this big. >>>>>>> >>>>>>> As well, with cypari, a call to pari.getheap() adds 1 object there, >>>>>>> a >>>>>>> bug, I guess. >>>>>>> (this does not happen with cypari2) >>>>>>> In [14]: pari.getheap() >>>>>>> Out[14]: [3300004, 163655741] >>>>>>> In [15]: pari.getheap() >>>>>>> Out[15]: [3300005, 163655758] >>>>>>> In [16]: pari.getheap() >>>>>>> Out[16]: [3300006, 163655775] >>>>>>> In [17]: pari.getheap() >>>>>>> Out[17]: [3300007, 163655792] >>>>>>> In [18]: pari.getheap() >>>>>>> Out[18]: [3300008, 163655809] >>>>>>> >>>>>>> Looks like a memory management bug in both, cypari and cypari2. >>>>>>> >>>>>>> Dima >>>>>>> >>>>>>> > >>>>>>> > - Marc >>>>>>> > >>>>>>> > On Thursday, August 29, 2024 at 1:19:05 PM UTC-5 dim...@gmail.com >>>>>>> wrote: >>>>>>> >> >>>>>>> >> It would be good to reproduce this with cypari2 alone. >>>>>>> >> cypari2 is known to have similar kind (?) of problems: >>>>>>> >> https://github.com/sagemath/cypari2/issues/112 >>>>>>> >> >>>>>>> >> >>>>>>> >> On Thu, Aug 29, 2024 at 6:47 PM Nils Bruin <nbr...@sfu.ca> >>>>>>> wrote: >>>>>>> >> > >>>>>>> >> > On Thursday 29 August 2024 at 09:51:04 UTC-7 Georgi Guninski >>>>>>> wrote: >>>>>>> >> > >>>>>>> >> > I observe that the following does not leak: >>>>>>> >> > >>>>>>> >> > E=EllipticCurve([5*13,0]) #no leak >>>>>>> >> > rn=E.root_number() >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > How do you know that doesn't leak? Do you mean that repeated >>>>>>> execution of those commands in the same session does not swell memory >>>>>>> use? >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > The size of the leak is suspiciously close to a power of two. >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > I don't think you can draw conclusions from that. Processes >>>>>>> generally request memory in large blocks from the operating system, to >>>>>>> amortize the high overhead in the operation. It may even be the case >>>>>>> that >>>>>>> 128 Mb is the chunk size involved here! The memory allocated to a >>>>>>> process >>>>>>> by the operating system isn't a fully accurate measure of memory >>>>>>> allocation >>>>>>> use in the process either: a heap manager can decide it's cheaper to >>>>>>> request some new pages from the operating system than to reorganize its >>>>>>> heap and reuse the fragmented space on it. I think for this loop, >>>>>>> memory >>>>>>> allocation consistently swells with repeated execution, so there >>>>>>> probably >>>>>>> really is something leaking. But given that it's not in GC-tracked >>>>>>> objects >>>>>>> on the python heap, one would probably need valgrind information or a >>>>>>> keen >>>>>>> look at the code involved to locate where it's coming from. >>>>>>> >> > >>>>>>> >> > -- >>>>>>> >> > You received this message because you are subscribed to the >>>>>>> Google Groups "sage-devel" group. >>>>>>> >> > To unsubscribe from this group and stop receiving emails from >>>>>>> it, send an email to sage-devel+...@googlegroups.com. >>>>>>> >> > To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/sage-devel/e63e2ec9-106a-4ddd-ab16-5c6db4fe83b4n%40googlegroups.com. >>>>>>> >>>>>>> >>>>>>> > >>>>>>> > -- >>>>>>> > You received this message because you are subscribed to the Google >>>>>>> Groups "sage-devel" group. >>>>>>> > To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to sage-devel+...@googlegroups.com. >>>>>>> > To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/sage-devel/8f309f36-5a39-4677-a137-60a724e0d970n%40googlegroups.com. >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "sage-devel" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/sage-devel/fWBls6YbXmw/unsubscribe. >>>> To unsubscribe from this group and all its topics, send an email to >>>> sage-devel+...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/sage-devel/5211d1b1-0bfc-4e1b-bd0a-8f499539c4cfn%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/sage-devel/5211d1b1-0bfc-4e1b-bd0a-8f499539c4cfn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "sage-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/sage-devel/0302f25d-6d39-40fa-be04-11f9d9a5f04cn%40googlegroups.com.