Re: why is bytearray treated so inefficiently by pickle?

Robert Kern Tue, 06 Dec 2011 12:28:29 -0800

On 12/6/11 7:27 PM, John Ladasky wrote:

On a related note, pickling of arrays of float64 objects, as generated
by the numpy package for example, are wildly inefficient with memory.
A half-million float64's requires about 4 megabytes, but the pickle
file I generated from a numpy.ndarray of this size was 42 megabytes.


I know that numpy has its own pickle protocol, and that it's supposed
to help with this problem.  Still, if this is a general problem with
Python and pickling numbers, it might be worth solving it in the
language itself.


It is. Use protocol=HIGHEST_PROTOCOL when dumping the array to a pickle.

[~]
|1> big = np.linspace(0.0, 1.0, 500000)

[~]
|2> import cPickle

[~]
|3> len(cPickle.dumps(big))
11102362

[~]
|4> len(cPickle.dumps(big, protocol=cPickle.HIGHEST_PROTOCOL))
4000135

The original conception for pickle was that it would have an ASCIIrepresentation for optimal cross-platform compatibility. These were the dayswhen people still used FTP regularly, and you could easily (and silently!) screwup binary data if you sent it in ASCII mode by accident. This necessarilycreates large files for numpy arrays. Further iterations on the picklingprotocol let numpy use raw binary data in the pickle. However, for backwardscompatibility, the default protocol is the one Python started out with. If youexplicitly use the most recent protocol, then you will get the efficiency benefits.


--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

--
http://mail.python.org/mailman/listinfo/python-list

Re: why is bytearray treated so inefficiently by pickle?

Reply via email to