> I have a large (gigabytes) file which is encoded in UTF-8 and then > compressed with gzip. I'd like to read it with the "gzip" module > and "utf8" decoding.
You didn't specify the processing you want to perform. For example, this should work just fine fd = gzip.open(fname, 'rb') for line in fd.readline(): pass For that processing, it is not even necessary to know what the encoding of the file is, except that it is an ASCII superset (which UTF-8 is). > The obvious approach is > > fd = gzip.open(fname, 'rb',encoding='utf8') > > But "gzip.open" doesn't support an "encoding" parameter. (It > probably should, for consistency.) I think I disagree. The builtin open function does not support an encoding argument, either (in Python 2.x). Conceptually, gzip operates on byte streams, not character streams. > Is it possible to express "unzip, then decode utf8" via > "codecs.open"? If that's the processing you want to do - sure fd0 = gzip.open(fname, 'rb') fd = codecs.getreader("utf-8")(fd0) data = fd.readline() You can combine that to fd = codecs.getreader("utf-8")(gzip.open(fname)) HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list