On 12/6/11 7:27 PM, John Ladasky wrote:
On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.
I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem. Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.
It is. Use protocol=HIGHEST_PROTOCOL when dumping the array to a pickle.
[~]
|1> big = np.linspace(0.0, 1.0, 500000)
[~]
|2> import cPickle
[~]
|3> len(cPickle.dumps(big))
11102362
[~]
|4> len(cPickle.dumps(big, protocol=cPickle.HIGHEST_PROTOCOL))
4000135
The original conception for pickle was that it would have an ASCII
representation for optimal cross-platform compatibility. These were the days
when people still used FTP regularly, and you could easily (and silently!) screw
up binary data if you sent it in ASCII mode by accident. This necessarily
creates large files for numpy arrays. Further iterations on the pickling
protocol let numpy use raw binary data in the pickle. However, for backwards
compatibility, the default protocol is the one Python started out with. If you
explicitly use the most recent protocol, then you will get the efficiency benefits.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
--
http://mail.python.org/mailman/listinfo/python-list