New submission from James Hennessy:

The tempfile.SpooledTemporaryFile class doesn't correctly preserve data for 
text (non-binary) SpooledTemporaryFile objects when Unicode characters are 
written.  The attached program demonstrates the failure.  It creates a 
SpooledTemporaryFile object, writes 20 string characters to it, and then tries 
to read them back.  If the SpooledTemporaryFile has rolled over to disk, as it 
does in the demonstration program, then the data is not read back correctly.  
Instead, an exception is recognized due to the data in the SpooledTemporaryFile 
being corrupted.

The problem is this statement in tempfile.py, in the rollover() method:
        newfile.seek(file.tell(), 0)
The "file" variable references a StringIO object, whose tell() and seek() 
methods count in characters, not bytes, yet this value is applied to a 
TemporaryFile object, whose tell() and seek() methods deal in bytes, not 
characters.  The demonstration program writes 10 characters to the 
SpooledTemporaryFile.  Since 10 exceeds the rollover size of 5, the 
implementation writes the 10 characters to the TemporaryFile and then seeks to 
position 10 in the TemporaryFile, which it thinks is the end of the stream.  
But those 10 characters got encoded to 30 bytes, and seek position 10 is in the 
middle of the UTF-8 sequence for the fourth character.  The next write to the 
SpooledTemporaryFile starts overlaying bytes from there.  The attempt to read 
back the data fails because the byte stream no longer represents a valid UTF-8 
stream of data.

The related problem is the inconsistency of the behavior of tell() and seek() 
for text (non-binary) SpooledTemporaryFile objects.  If the data hasn't yet 
rolled over to a TemporaryFile, they count in string characters.  If the data 
has rolled over, they count in bytes.

A quick fix for this is to remove the seek() in the rollover() method.  I 
presume it's there to preserve the stream position if an explicit call to 
rollover() is made, since for an implicit call, the position would be at the 
end of the stream already.  This quick fix, therefore, would introduce an 
external incompatibility in the behavior of rollover().

Another possibility is to never use a StringIO object, but to always buffer 
data in a BytesIO object, as is done for binary SpooledTemporaryFile objects.  
This has the advantage of "fixing" the tell() and seek() inconsistency, making 
them count bytes all the time.  The downside, of course, is that data that 
doesn't end up being rolled over to a TemporaryFile gets encoded and decoded, a 
round trip that could otherwise be avoided.

This problem can be circumvented by a user of SpooledTemporaryFile by 
explicitly seeking to the end of the stream after every write to the 
SpooledTemporaryFile object:  spool.seek(0, io.SEEK_END)

----------
components: Library (Lib)
files: showbug.py
messages: 263147
nosy: James Hennessy
priority: normal
severity: normal
status: open
title: SpooledTemporaryFile doesn't correctly preserve data for text 
(non-binary) SpooledTemporaryFile objects when Unicode characters are written
type: behavior
versions: Python 3.4
Added file: http://bugs.python.org/file42423/showbug.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26730>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to