On Mon, Sep 14, 2015, at 18:09, Tim Peters wrote: > Sorry, I'm not arguing about this any more. Pickle doesn't work at > all at the level of "count of bytes followed by a string".
The SHORT_BINBYTES opcode consists of the byte b'C', followed by *yes indeed* "count of bytes followed by a string". > If you > want to make a pickle argument that makes sense, I'm afraid you'll > need to become familiar with how pickle works first. This is not the > place for a pickle tutorial. > > Start by learning what a datetime pickle actually is. > pickletools.dis() will be very helpful. 0: \x80 PROTO 3 2: c GLOBAL 'datetime datetime' 21: q BINPUT 0 23: C SHORT_BINBYTES b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00' 35: q BINPUT 1 37: \x85 TUPLE1 38: q BINPUT 2 40: R REDUCE 41: q BINPUT 3 43: . STOP The payload is ten bytes, and the byte immediately before it is in fact 0x0a. If I pickle any byte string under 256 bytes long by itself, the byte immediately before the data is the length. This is how I initially came to the conclusion that "count of bytes followed by a string" was valid. I did, before writing my earlier post, look into the high-level aspects of how datetime pickle works - it uses __reduce__ to create up to two arguments, one of which is a 10-byte string, and the other is the tzinfo. Those arguments are passed into the date constructor and detected by that constructor - for example, I can call it directly with datetime(b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00') and get the same result as unpickling. At the low level, the part that represents that first argument does indeed appear to be "count of bytes followed by a string". I can add to the count, add more bytes, and it will call the constructor with the longer string. If I use pickletools.dis on my modified value the output looks the same except for, as expected, the offsets and the value of the argument to the SHORT_BINBYTES opcode. So, it appears that, as I was saying, "wasted space" would not have been an obstacle to having the "payload" accepted by the constructor (and produced by __reduce__ ultimately _getstate) consist of "a byte string of >= 10 bytes, the first 10 of which are used and the rest of which are ignored by python <= 3.5" instead of "a byte string of exactly 10 bytes", since it would have accepted and produced exactly the same pickle values, but been prepared to accept larger arguments pickled from future versions. For completeness: Protocol version 2 and 1 use BINUNICODE on a latin1-to-utf8 version of the byte string, with a similar "count of bytes followed by a string" (though the count of bytes is of UTF-8 bytes). Protocol version 0 uses UNICODE, terminated by \n, and a literal \n is represented by \\u000a. In all cases some extra data around the value sets it up to call "codecs.encode(..., 'latin1')" upon unpickling. So have I shown you that I know enough about the pickle format to know that permitting a longer string (and ignoring the extra bytes) would have had zero impact on the pickle representation of values that did not contain a longer string? I'd already figured out half of this before writing my earlier post; I just assumed *you* knew enough that I wouldn't have to show my work. Extra credit: 0: \x80 PROTO 3 2: c GLOBAL 'datetime datetime' 21: q BINPUT 0 23: ( MARK 24: M BININT2 2014 27: K BININT1 9 29: K BININT1 14 31: K BININT1 21 33: K BININT1 6 35: K BININT1 42 37: t TUPLE (MARK at 23) 38: q BINPUT 1 40: R REDUCE 41: q BINPUT 2 43: . STOP -- https://mail.python.org/mailman/listinfo/python-list