[issue14596] struct.unpack memory leak

2012-04-16 Thread Robert Elsner

New submission from Robert Elsner :

When unpacking multiple files with _variable_ length, struct unpack leaks 
massive amounts of memory. The corresponding functions from numpy (fromfile) or 
the array (fromfile) standard lib module behave as expected.

I prepared a minimal testcase illustrating the problem on 

Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2

This is a severe limitation when reading big files where performance is 
critical. The struct.Struct class does not display this behavior. Note that the 
variable length of the buffer is necessary to reproduce the problem (as is 
usually the case with real data files).
I suspect this is due to some internal buffer in the struct module not being 
freed after use.
I did not test on later Python versions, but could not find a related bug in 
the tracker.

--
components: Library (Lib)
files: unpack_memory_leak.py
messages: 158418
nosy: Robert.Elsner
priority: normal
severity: normal
status: open
title: struct.unpack memory leak
versions: Python 2.6
Added file: http://bugs.python.org/file25238/unpack_memory_leak.py

___
Python tracker 
<http://bugs.python.org/issue14596>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14596] struct.unpack memory leak

2012-04-16 Thread Robert Elsner

Robert Elsner  added the comment:

I would love to test but I am in a production environment atm and can't really 
spare the time to set up a test box. But maybe somebody with access to 2.7 on 
linux could test it with the supplied script (just start it and it should 
happily eat 8GB of memory - I think most users are going to notice ;)

--

___
Python tracker 
<http://bugs.python.org/issue14596>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14596] struct.unpack memory leak

2012-04-16 Thread Robert Elsner

Robert Elsner  added the comment:

Well seems like 3.1 is in the Debian repos as well. Same memory leak. So it is 
very unlikely it has been fixed in 2.7. I modified the test case to be 
compatible to 3.1 and 2.6.

--
versions: +Python 3.1
Added file: http://bugs.python.org/file25239/unpack_memory_leak.py

___
Python tracker 
<http://bugs.python.org/issue14596>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14596] struct.unpack memory leak

2012-04-16 Thread Robert Elsner

Robert Elsner  added the comment:

Well the problem is, that performance is severely degraded when calling unpack 
multiple times. I do not know in advance the size of the files and they might 
vary in size from 1M to 1G. I could use some fixed-size buffer which is 
inefficient depending on the file size (too big or too small). And if I change 
the buffer on the fly, I end up with the memory leak. I think the caching 
should take into account the available memory on the system. the no_leak 
function has comparable performance without the leak. And I think there is no 
point in caching Struct instances when they go out of scope and can not be 
accessed anymore? If i let it slip from the scope I do not want to use it 
thereafter. Especially considering that struct.Struct behaves as expected as do 
array.fromfile and numpy.fromfile.

--

___
Python tracker 
<http://bugs.python.org/issue14596>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14596] struct.unpack memory leak

2012-04-16 Thread Robert Elsner

Robert Elsner  added the comment:

Well I stumbled across this leak while reading big files. And what is
the point of having a fast C-level unpack when it can not be used with
big files?
I am not adverse to the idea of caching the format string but if the
cache grows beyond a reasonable size, it should be freed. And
"reasonable" is not the number of objects contained but the amount of
memory it consumes. And caching an arbitrary amount of data (8GB here)
is a waste of memory.

And reading the Python docs, struct.Struct.unpack which is _not_
affected from the memory leak is supposed to be faster. Quote:

> class struct.Struct(format)
> 
> Return a new Struct object which writes and reads binary data according to 
> the format string format. Creating a Struct object once and calling its 
> methods is more efficient than calling the struct functions with the same 
> format since the format string only needs to be compiled once.

Caching in case of struct.Struct is straightforward: As long as the
object exists, the format string is cached and if the object is no
longer accessible, its memory gets freed - including the cached format
string. The problem is with the "magic" creation of struct.Struct
objects by struct.unpack that linger around even after all associated
variables are no longer in scope.

Using for example fixed 1MB buffer to read files (regardless of size)
incurs a huge performance penalty. Reading everything at once into
memory using struct.unpack (or with the same speed struct.Struct.unpack)
is the fastest way. Approximately 40% faster than array.fromfile and and
70% faster than numpy.fromfile.

I read some unspecified report about a possible memory leak in
struct.unpack but the author did not investigate further. It took me
quite some time to figure out what exactly happens. So there should be
at least a warning about this (ugly) behavior when reading big files for
speed and a pointer to a quick workaround (using struct.Struct.unpack).

cheers

Am 16.04.2012 15:59, schrieb Antoine Pitrou:
> 
> Antoine Pitrou  added the comment:
> 
>> Perhaps the best quick fix would be to only cache small
>> PyStructObjects, for some value of 'small'.  (Total size < a few
>> hundred bytes, perhaps.)
> 
> Or perhaps not care at all? Is there a use case for huge repeat counts?
> (limiting cacheability could decrease performance in existing
> applications)
> 
> --
> 
> ___
> Python tracker 
> <http://bugs.python.org/issue14596>
> ___

--

___
Python tracker 
<http://bugs.python.org/issue14596>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14596] struct.unpack memory leak

2012-04-23 Thread Robert Elsner

Robert Elsner  added the comment:

Well then at least the docs need an update. I simply fail to see how a
cache memory leak constitutes "just fine" (while the caching behavior of
struct.unpack is not documented - if somebody wants caching, he ought to
use struct.Struct.unpack which does cache and does not leak). Something
like a warning: "struct.unpack might display memory leaks when parsing
big files using large format strings" might be sufficient. I do not like
the idea of code failing outside some not-documented "use-case".
Especially as those problems usually indicate some underlying design
flaw. I did not review the proposed patch but might find time to have a
look in a few months.

cheers

Am 20.04.2012 19:56, schrieb Mark Dickinson:
> 
> Mark Dickinson  added the comment:
> 
> IMO, the struct module does what it's intended to do just fine here.  I don't 
> a big need for any change.  I'd propose closing this as "won't fix".
> 
> --
> 
> ___
> Python tracker 
> <http://bugs.python.org/issue14596>
> ___

--

___
Python tracker 
<http://bugs.python.org/issue14596>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com