On 29Jul2015 07:52, dieter <die...@handshake.de> wrote:
"=?GBK?B?wO68zsX0?=" <lijpba...@126.com> writes:
Hi, I tried using seek to reverse a text file after reading about the
subject in the documentation:
https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
https://docs.python.org/3/library/io.html#io.TextIOBase.seek
...
However, an exception is raised if a file with the same content encoded in
GBK is provided:
    $ ./reverse_text_by_seek3.py Moon-gbk.txt
    [0, 7, 8, 19, 21, 32, 42, 53, 64]
    µÍͷ˼¹ÊÏç
    ¾ÙÍ·ÍûÃ÷ÔÂ
    Traceback (most recent call last):
      File "./reverse_text_by_seek3.py", line 21, in <module>
        print(f.readline(), end="")
    UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 8: 
illegal multibyte sequence

The "seek" works on byte level while decoding works on character level
where some characters can be composed of several bytes.

The error you observe indicates that you have "seeked" somewhere
inside a character, not at a legal character beginning.

That you get an error for "gbk" and not for "utf-8" is a bit of
an "accident". The same problem can happen for "utf-8" but the probability
might by sligtly inferior.

Seek only to byte position for which you know that they are also
character beginnings -- e.g. line beginnings.

You may also keep in mind that while you can't do arithmetic on these things without knowning the length of the _encoded_ text, what you can do is note the value returned by f.tell() whenever you like. If you are reading a text file (== an encoding of the text in a specific character set, be it GBK or UTF8) then after any read you will be on a character boundary, and can return there.

Actually, on reflection, there may be some character encodings where this is not true; I think some encodings of Japanese employ some kind of mode shift sequence, so you might need knowledge of those - a plain seek() might not be enough. But for any encoding where the character encoded at a spot is everything needed then a seek() to any position obtained by tell() should be reliable.

In short: line beginnings are not the only places where you can safely seek. Though they may be conveniently available.

Cheers,
Cameron Simpson <c...@zip.com.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to