Re: How to read gzipped utf8 file in Python?

Martin v. Löwis Thu, 22 Nov 2007 12:38:09 -0800

>   I have a large (gigabytes) file which is encoded in UTF-8 and then
> compressed with gzip.  I'd like to read it with the "gzip" module
> and "utf8" decoding.


You didn't specify the processing you want to perform. For example,
this should work just fine

fd = gzip.open(fname, 'rb')
for line in fd.readline():
    pass

For that processing, it is not even necessary to know what the encoding
of the file is, except that it is an ASCII superset (which UTF-8 is).

> The obvious approach is
> 
>     fd = gzip.open(fname, 'rb',encoding='utf8')
> 
> But "gzip.open" doesn't support an "encoding" parameter.  (It
> probably should, for consistency.)

I think I disagree. The builtin open function does not support an
encoding argument, either (in Python 2.x). Conceptually, gzip operates
on byte streams, not character streams.

> Is it possible to express "unzip, then decode utf8" via
> "codecs.open"?

If that's the processing you want to do - sure

fd0 = gzip.open(fname, 'rb')
fd = codecs.getreader("utf-8")(fd0)
data = fd.readline()

You can combine that to

fd = codecs.getreader("utf-8")(gzip.open(fname))

HTH,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to read gzipped utf8 file in Python?

Reply via email to