Re: [sage-devel] Re: Memory leak in |EllipticCurve([n,0]).root_number()| and problem in algebraic geometry

Marc Culler Tue, 03 Sep 2024 10:56:55 -0700

That is a pretty interesting exchange, it indicates that this memory leak 
has been known to exist for 6 years, and the remedy is also discussed 
there.  Details and answers to questions asked in this thread follow.


To answer Dima's question first, the Pari documentation that I am referring 
to is the User's Guide to the PARI library (libpari.pdf).  Now for the 
details ...

The Pari GEN constructed by pari.ellinit is what Pari calls a "lazy 
vector".  These are discussed in section 12.3, which begins by saying
"The functions in this section are used to implement ell structures and 
analogous objects, which are vectors some of whose components are 
initialized to dummy values, later computed on demand."
The only example known to me of a lazy vector is the GEN constructed by the
ellinit function.

A "Technical Note" in thst section contains more detail:
"In the current implementation, S is a t_VEC with d + 1 entries. The first 
d components are ordinary t_GEN entries, which you can read or assign to in 
the customary way. But the last component gel(S, d + 1), a t_VEC of length 
n initialized to zerovec(n), must be handled in a special way: you should 
never access or modify its components directly, only through the API we are 
about to describe."

You can see an example of this behavior in Sage as follows:

  sage: import cypari2
  sage: pari = cypari2.Pari()
  sage: e = pari.ellinit([11, 0]); e
  [0, 0, 0, 11, 0, 0, 22, 0, -121, -528, 0, -85184, 1728, Vecsmall([1]),
  [Vecsmall([64, -1])], [0, 0, 0, 0, 0, 0, 0, 0]]

The last entry of e is a t_VEC containing 8 references to the PARI Zero 
GEN, created by obj_init.  The ellrootno function fills in this vector, 
replacing the Zero placeholders by pointers to Pari GENS on the heap.  To 
see this in sage:

sage: pari.ellrootno(e)
1
sage: e
[0, 0, 0, 11, 0, 0, 22, 0, -121, -528, 0, -85184, 1728, Vecsmall([1]), 
[Vecsmall([64, -1])], [0, 0, 0, 0, 0, [3872, 4, [2, 5; 11, 2], [[5, 3, 0, 
2], [2, 3, 0, 2]]], [1, Vecsmall([-1, 1])], [[2, 11]~]]]

To answer Giorgi's question one can observe that, while ellrootno *usually* 
fills in the lazy components, sometimes it doesn't need to:
sage: e = pari.ellinit([2**70, 0]); e
[0, 0, 0, 1180591620717411303424, 0, 0, 2361183241434822606848, 0, 
-1393796574908163946345982392040522594123776, -56668397794435742564352, 0, 
-105312291668557186697918027683670432318895095400549111254310977536, 1728, 
Vecsmall([1]), [Vecsmall([64, -1])], [0, 0, 0, 0, 0, 0, 0, 0]]
sage: pari.ellrootno(e)
1
sage: e
[0, 0, 0, 1180591620717411303424, 0, 0, 2361183241434822606848, 0, 
-1393796574908163946345982392040522594123776, -56668397794435742564352, 0, 
-105312291668557186697918027683670432318895095400549111254310977536, 1728, 
Vecsmall([1]), [Vecsmall([64, -1])], [0, 0, 0, 0, 0, [32, 4, Mat([2, 5]), 
[[5, -7, 0, 4]]], [1, Vecsmall([-1])], [[2]~, [131072, 0, 0, 0], [0, 0, 0, 
4, 0, 0, 8, 0, -16, -192, 0, -4096, 1728, Vecsmall([1]), [Vecsmall([64, 
-1])], [0, 0, 0, 0, 0, 0, 0, 0]]]]]

(Note that the vector in the last entry is still all zeroes.)

In any case, the heap GENs added lazily by ellrootno cause the memory leak. 
 If the PARI GEN managed by the Python Gen e is passed to gunclone (as 
happens in the Gen.__dealloc__ method) those added heap GENs are not freed. 
 However, the documented API addresses this issue via the function obj_free:
"void obj_free(GEN S) destroys all clones stored in the n tagged 
components, and replace them by the initial value 0. The regular entries of 
S are unaffected, and S remains a valid object. This is used to avoid 
memory leaks."

Thus the fix for this memory leak (now implemented in CyPari) is to track 
which Python Gen objects are managing "lazy vector" GENs, and to modify the 
Gen.__dealloc__ method to make it call obj_free before calling gunclone for 
those objects.  Unfortunately, if obj_free is passed a GEN which is not a 
lazy vector, i.e. was not initialized by obj_init, then a segfault results. 
 So the tracking is essential.  In CyPari I added a cdef field to the Gen 
class which is used to track whether the GEN is a lazy vector.  I had to 
modify the auto_gen code to allow for setting that flag for all such GENs. 
 (Currently that means only the GENs produced by calling ellinit.  I 
couldn't find any others, although obj_init does seem to be called by 
rnfinit as well.)

- Marc

On Monday, September 2, 2024 at 12:13:43 PM UTC-5 dim...@gmail.com wrote:

> It appears that all these gunclone_deep() etc aren't documented in any 
> proper way (and this is a root cause of this; same applies to memleaks we 
> have in libsingular interface). Can you post a link to docs you are reading?
>
> I can only find an email exchange from 2018 where 
> similar issues are discussed.
> It seems that gunclone_deep should be used sparingly.
> The last message,from Jeroen Demeyer, in that thread says:
>
> > On 2018-08-20 18:14, Karim Belabas wrote: 
>
> In libpari code, it's usually easy to explicitly call a destructor [ e.g. 
> obj_free ] any time a function such as 'ellinit' was used 
>
> > It's not easy if the list of functions "such as 'ellinit'" is not known. 
>
> <https://pari.math.u-bordeaux.fr/archives/pari-dev-1808/msg00012.html> 
>
>
> On 2 September 2024 03:30:14 BST, Marc Culler <marc....@gmail.com> wrote:
>
>> Unsurprisingly, changing gunclone to gunclone_deep in Gen.__dealloc__ is 
>> not a full fix.  It stops this leak but it also causes a crash when running 
>> the doctests.  On macOS the crash error is "freeing a pointer which was not 
>> allocated".  On linux the error is "corrupted size vs. prev_size in 
>> fastbins".
>>
>> I won't force this list to participate in the full debugging experience. 
>> The basic problem has been identified.  The remedy is far from clear.
>>
>> - Marc
>>
>> On Sun, Sep 1, 2024, 5:37 PM Marc Culler <marc....@gmail.com> wrote:
>>
>>> This is now fixed in cypari.  The fix consisted of replacing gunclone by 
>>> gunclone_deep in the Gen.__dealloc__ method.  The same change would 
>>> probably help in cypari2, although I would not expect it to fix the other 
>>> memory leaks reported in issue #112.
>>>
>>> Here is the explanation.  The Pari GEN returned by pari.ellinit is of 
>>> type t_VEC with entries of various types including t_INT and t_VECSMALL.  
>>> The function pari.ellrootno returns a Python int, but it has side effects.  
>>> It also inserts items into the t_VEC GEN representing the elliptic curve.  
>>> That t_VEC GEN is managed by a Python Gen object, and that Python Gen 
>>> object would pass the t_VEC GEN to gunclone in its __dealloc method.  
>>> However, that was not sufficient to cause the new items which were inserted 
>>> into the t_VEC gen by pari.ellrootno to also be uncloned.  Consequently, 
>>> those extra GENs remained on the Pari heap forever.  Once the python object 
>>> was gone, there was no event which would cause them to be destroyed.
>>>
>>> If you read the Pari documentation you will "learn" that gunclone_deep 
>>> is only useful in the context of the peculiar memory model used by gp.  
>>> Well, apparently it is also useful in the context of the peculiar memory 
>>> model used by cypari and cypari2.
>>>
>>> I tested the cypari fix on linux ubuntu using the test function 
>>> described earlier in this thread.  With 10**8 iterations it never used over 
>>> 8.4MB of memory.
>>>
>>> - Marc
>>>
>>> On Sunday, September 1, 2024 at 12:07:26 PM UTC-6 Marc Culler wrote:
>>>
>>>> My experiments with cypari so far indicate the opposite of what I 
>>>> expected.  (This is with one small but significant bug fix that I found 
>>>> along the way.) The main observations are:
>>>>    (1) Without the call to pari.ellrootno there is no memory leak.  The 
>>>> call to pari.getheap returns the same value before and after creating many 
>>>> elliptic curves in a for loop.
>>>>   (2) Calling pari.ellrootno(e) in a loop, with the same elliptic curve 
>>>> e,  also does not increase the Pari heap size.
>>>>   (3) With both the call to pari.ellinit and the call to  
>>>> pari.ellrootno in the loop, every Python Gen object created in the loop 
>>>> gets deallocated and its Pari GEN gets uncloned with reference count 1, 
>>>> meaning it is removed from the heap. Nonetheless, the heap grows 
>>>> considerably during the loop.  
>>>>
>>>> I conclude that Pari is creating and cloning its own GENs (i.e. GENs 
>>>> not managed by any Python object) during the call to ellrootno and is not 
>>>> uncloning all of those GENs.  But it only does that with a newly created 
>>>> curve.
>>>>  
>>>>
>>>> - Marc
>>>> On Sunday, September 1, 2024 at 10:24:33 AM UTC-6 Marc Culler wrote:
>>>>
>>>>> I would say that the getheap behavior is a symptom of the same memory 
>>>>> management bug(s).  Also, it is not as simple as you suggest.  It is not 
>>>>> always true that every call to getheap leaves something on the heap:
>>>>>
>>>>> >>> from cypari import pari
>>>>> >>> pari.getheap()
>>>>> [4, 41]
>>>>> >>> pari.getheap()
>>>>> [5, 58]
>>>>> >>> pari.getheap()
>>>>> [5, 58]
>>>>> >>> pari.getheap()
>>>>> [5, 58]
>>>>>
>>>>> It is also not true that the cypari and cypari2 behaviors are 
>>>>> equivalent, although they are similar.  (Running the test loop with 10**5 
>>>>> iterations does not use anything close to 28 GB with cypari).  I believe 
>>>>> that some, but definitely not all, of the memory management issues -- 
>>>>> namely the most glaring ones described in issue #112 -- were improved in 
>>>>> cypari by removing the code which tries to keep wrapped Pari GENs on the 
>>>>> stack instead of moving them to the heap.
>>>>>
>>>>> The behavior where Pari GENs go on the heap and never get removed is 
>>>>> intertwined with python memory management.  Python and the Pari heap use 
>>>>> independent reference counting schemes.  The __dealloc__ method of a 
>>>>> Python 
>>>>> Gen object calls gunclone to reduce the reference count of the Pari heap 
>>>>> GEN which is referenced by the python Gen object.  However, as demonstred 
>>>>> in issue #112, cypari2 was creating other internal Python objects which 
>>>>> held references to a Python Gen wrapping a vector or matrix entry.  That 
>>>>> prevented those Python Gens from being destroyed and therefore prevented 
>>>>> the Pari GENs wrapped by those Python Gens from being removed from the 
>>>>> Pari 
>>>>> heap.
>>>>>
>>>>> Apart from the initial call to gclone when a Python Gen is created, I 
>>>>> haven't found any code within cypari which could increase the reference 
>>>>> count of a Pari GEN on the Pari heap.  Unless Pari sometimes increases 
>>>>> that 
>>>>> reference count, this would suggest that the core issue lies with Python 
>>>>> reference counts.  Evidently Python Gen objects are not being dealloc'ed 
>>>>> because other Python objects hold references to them, and this is 
>>>>> preventing Pari from removing GENs from its heap.
>>>>>
>>>>> On Saturday, August 31, 2024 at 1:11:45 PM UTC-5 dim...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> On Sat, Aug 31, 2024 at 4:35 AM Marc Culler <marc....@gmail.com> 
>>>>>> wrote: 
>>>>>> > 
>>>>>> > As Dima says, and as the issue he mentions supports, the current 
>>>>>> cypari2 code which attempts to keep Pari Gens on the Pari stack as much 
>>>>>> as 
>>>>>> possible is badly broken. There are many situations where Python Gen 
>>>>>> objects cannot be garbage-collected after being destroyed. I am sure 
>>>>>> that 
>>>>>> is a big part of this problem. But I don't think it is the whole story. 
>>>>>> > 
>>>>>> > CyPari has returned to the older design which moves the Pari Gen 
>>>>>> wrapped by a Python Gen to the pari heap when the python object is 
>>>>>> created. 
>>>>>> This eliminates the leaks reported in cypari2 issue #112. But in this 
>>>>>> context, I am seeing 12 GB of memory (including several gigabytes of 
>>>>>> swap) 
>>>>>> in use after I do the following in ipython: 
>>>>>> > 
>>>>>> > In [1]: from cypari import * 
>>>>>> > In [2]: def test(N): 
>>>>>> > ...: for a in range(1, N): 
>>>>>> > ...: e = pari.ellinit([a, 0]) 
>>>>>> > ...: m = pari.ellrootno(e) 
>>>>>> > In [3]: %time test(10**5) 
>>>>>> > CPU times: user 699 ms, sys: 38.3 ms, total: 737 ms 
>>>>>> > Wall time: 757 ms 
>>>>>> > In [4]: %time test(10**6) 
>>>>>> > CPU times: user 7.47 s, sys: 392 ms, total: 7.86 s 
>>>>>> > Wall time: 7.93 s 
>>>>>> > In [5]: %time test(10**7) 
>>>>>> > CPU times: user 1min 41s, sys: 6.62 s, total: 1min 47s 
>>>>>> > Wall time: 1min 49s 
>>>>>>
>>>>>> the picture is very similar to cypari2. 
>>>>>> (with cypari2, one has to run 
>>>>>>
>>>>>> pari=Pari() 
>>>>>>
>>>>>> after the 
>>>>>>
>>>>>> from cypari2 import * 
>>>>>>
>>>>>> but otherwise it's the same code) 
>>>>>>
>>>>>> You can inspect the Pari heap, by calling 
>>>>>> pari.getheap() 
>>>>>>
>>>>>> each call to test() gets you about 3 new objects per call on the 
>>>>>> heap, 
>>>>>> so after 10^5 calls you get around 300000 
>>>>>> objects there (and with 10^6 calls, around 3 million are added). The 
>>>>>> following is with cypari 
>>>>>>
>>>>>> In [4]: %time test(10**5) 
>>>>>> CPU times: user 1.46 s, sys: 85.8 ms, total: 1.55 s 
>>>>>> Wall time: 1.55 s 
>>>>>> In [5]: pari.getheap() 
>>>>>> Out[5]: [300001, 14394782] 
>>>>>> In [6]: %time test(10**6) 
>>>>>> CPU times: user 14.9 s, sys: 756 ms, total: 15.7 s 
>>>>>> Wall time: 15.7 s 
>>>>>> In [7]: pari.getheap() 
>>>>>> Out[7]: [3299999, 163655656] 
>>>>>>
>>>>>> With cypari2, similar: 
>>>>>>
>>>>>> In [9]: pari.getheap() # 10^5 
>>>>>> Out[9]: [299969, 14392931] 
>>>>>>
>>>>>> In [12]: pari.getheap() # 10^6 
>>>>>> Out[12]: [3299662, 163635286] 
>>>>>>
>>>>>>
>>>>>> And gc.collect() does not do anything, in either case, Pari heap 
>>>>>> remains this big. 
>>>>>>
>>>>>> As well, with cypari, a call to pari.getheap() adds 1 object there, a 
>>>>>> bug, I guess. 
>>>>>> (this does not happen with cypari2) 
>>>>>> In [14]: pari.getheap() 
>>>>>> Out[14]: [3300004, 163655741] 
>>>>>> In [15]: pari.getheap() 
>>>>>> Out[15]: [3300005, 163655758] 
>>>>>> In [16]: pari.getheap() 
>>>>>> Out[16]: [3300006, 163655775] 
>>>>>> In [17]: pari.getheap() 
>>>>>> Out[17]: [3300007, 163655792] 
>>>>>> In [18]: pari.getheap() 
>>>>>> Out[18]: [3300008, 163655809] 
>>>>>>
>>>>>> Looks like a memory management bug in both, cypari and cypari2. 
>>>>>>
>>>>>> Dima 
>>>>>>
>>>>>> > 
>>>>>> > - Marc 
>>>>>> > 
>>>>>> > On Thursday, August 29, 2024 at 1:19:05 PM UTC-5 dim...@gmail.com 
>>>>>> wrote: 
>>>>>> >> 
>>>>>> >> It would be good to reproduce this with cypari2 alone. 
>>>>>> >> cypari2 is known to have similar kind (?) of problems: 
>>>>>> >> https://github.com/sagemath/cypari2/issues/112 
>>>>>> >> 
>>>>>> >> 
>>>>>> >> On Thu, Aug 29, 2024 at 6:47 PM Nils Bruin <nbr...@sfu.ca> wrote: 
>>>>>> >> > 
>>>>>> >> > On Thursday 29 August 2024 at 09:51:04 UTC-7 Georgi Guninski 
>>>>>> wrote: 
>>>>>> >> > 
>>>>>> >> > I observe that the following does not leak: 
>>>>>> >> > 
>>>>>> >> > E=EllipticCurve([5*13,0]) #no leak 
>>>>>> >> > rn=E.root_number() 
>>>>>> >> > 
>>>>>> >> > 
>>>>>> >> > How do you know that doesn't leak? Do you mean that repeated 
>>>>>> execution of those commands in the same session does not swell memory 
>>>>>> use? 
>>>>>> >> > 
>>>>>> >> > 
>>>>>> >> > The size of the leak is suspiciously close to a power of two. 
>>>>>> >> > 
>>>>>> >> > 
>>>>>> >> > I don't think you can draw conclusions from that. Processes 
>>>>>> generally request memory in large blocks from the operating system, to 
>>>>>> amortize the high overhead in the operation. It may even be the case 
>>>>>> that 
>>>>>> 128 Mb is the chunk size involved here! The memory allocated to a 
>>>>>> process 
>>>>>> by the operating system isn't a fully accurate measure of memory 
>>>>>> allocation 
>>>>>> use in the process either: a heap manager can decide it's cheaper to 
>>>>>> request some new pages from the operating system than to reorganize its 
>>>>>> heap and reuse the fragmented space on it. I think for this loop, memory 
>>>>>> allocation consistently swells with repeated execution, so there 
>>>>>> probably 
>>>>>> really is something leaking. But given that it's not in GC-tracked 
>>>>>> objects 
>>>>>> on the python heap, one would probably need valgrind information or a 
>>>>>> keen 
>>>>>> look at the code involved to locate where it's coming from. 
>>>>>> >> > 
>>>>>> >> > -- 
>>>>>> >> > You received this message because you are subscribed to the 
>>>>>> Google Groups "sage-devel" group. 
>>>>>> >> > To unsubscribe from this group and stop receiving emails from 
>>>>>> it, send an email to sage-devel+...@googlegroups.com. 
>>>>>> >> > To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/sage-devel/e63e2ec9-106a-4ddd-ab16-5c6db4fe83b4n%40googlegroups.com.
>>>>>>  
>>>>>>
>>>>>> > 
>>>>>> > -- 
>>>>>> > You received this message because you are subscribed to the Google 
>>>>>> Groups "sage-devel" group. 
>>>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to sage-devel+...@googlegroups.com. 
>>>>>> > To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/sage-devel/8f309f36-5a39-4677-a137-60a724e0d970n%40googlegroups.com.
>>>>>>  
>>>>>>
>>>>>>
>>>>> -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "sage-devel" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/sage-devel/fWBls6YbXmw/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> sage-devel+...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/sage-devel/5211d1b1-0bfc-4e1b-bd0a-8f499539c4cfn%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/sage-devel/5211d1b1-0bfc-4e1b-bd0a-8f499539c4cfn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sage-devel/d91ea756-6704-4b4e-a173-4b671ccfdf6an%40googlegroups.com.

Re: [sage-devel] Re: Memory leak in |EllipticCurve([n,0]).root_number()| and problem in algebraic geometry

Reply via email to