Re: Python3.1: gzip encoding with UTF-8 fails

Mark Tolonen Sun, 20 Dec 2009 10:04:45 -0800

"Diez B. Roggisch" <de...@nospam.web.de> wrote in messagenews:7p7328f3r1r2...@mid.uni-berlin.de...
Johannes Bauer schrieb:
> Hello group,
>
> with this following program:
>
> #!/usr/bin/python3
> import gzip
> x = gzip.open("testdatei", "wb")
> x.write("ä")
> x.close()
>
> I get a broken .gzip file when decompressing:
>
> $ cat testdatei |gunzip
> ä
> gzip: stdin: invalid compressed data--length error
>
> As it only happens with UTF-8 characters, I suppose the gzip module
UTF-8 is not unicode. Even if the source-encoding above is UTF-8, I'm notsure what is used to encode the unicode-string when it's written.
> writes a length of 1 in the gzip file header (one character "ä"), but
> then actually writes 2 characters (0xc3 0xa4).
>
> Is there a solution?
What about writinga bytestring by explicitly decoding the string to utf-8first?
x.write("ä".encode("utf-8"))

While that works, it still seems like a bug in gzip. If gzip.open isreplaced with a simple open:


# coding: utf-8
import gzip
x = open("testdatei", "wb")
x.write("ä")
x.close()

The result is:

Traceback (most recent call last):

File"C:\dev\python3\Lib\site-packages\Pythonwin\pywin\framework\scriptutils.py",line 427, in ImportFile

   exec(codeObj, __main__.__dict__)
 File "<auto import>", line 1, in <module>
 File "y.py", line 4, in <module>
   x.write("ä")
TypeError: must be bytes or buffer, not str

Opening a file in binary mode should require a bytes or buffer object.

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Re: Python3.1: gzip encoding with UTF-8 fails

Reply via email to