On 29Jul2015 07:52, dieter <die...@handshake.de> wrote:
"=?GBK?B?wO68zsX0?=" <lijpba...@126.com> writes:
Hi, I tried using seek to reverse a text file after reading about the
subject in the documentation:
https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
https://docs.python.org/3/library/io.html#io.TextIOBase.seek
...
However, an exception is raised if a file with the same content encoded in
GBK is provided:
$ ./reverse_text_by_seek3.py Moon-gbk.txt
[0, 7, 8, 19, 21, 32, 42, 53, 64]
µÍͷ˼¹ÊÏç
¾ÙÍ·ÍûÃ÷ÔÂ
Traceback (most recent call last):
File "./reverse_text_by_seek3.py", line 21, in <module>
print(f.readline(), end="")
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 8:
illegal multibyte sequence
The "seek" works on byte level while decoding works on character level
where some characters can be composed of several bytes.
The error you observe indicates that you have "seeked" somewhere
inside a character, not at a legal character beginning.
That you get an error for "gbk" and not for "utf-8" is a bit of
an "accident". The same problem can happen for "utf-8" but the probability
might by sligtly inferior.
Seek only to byte position for which you know that they are also
character beginnings -- e.g. line beginnings.
You may also keep in mind that while you can't do arithmetic on these things
without knowning the length of the _encoded_ text, what you can do is note the
value returned by f.tell() whenever you like. If you are reading a text file
(== an encoding of the text in a specific character set, be it GBK or UTF8)
then after any read you will be on a character boundary, and can return there.
Actually, on reflection, there may be some character encodings where this is
not true; I think some encodings of Japanese employ some kind of mode shift
sequence, so you might need knowledge of those - a plain seek() might not be
enough. But for any encoding where the character encoded at a spot is
everything needed then a seek() to any position obtained by tell() should be
reliable.
In short: line beginnings are not the only places where you can safely seek.
Though they may be conveniently available.
Cheers,
Cameron Simpson <c...@zip.com.au>
--
https://mail.python.org/mailman/listinfo/python-list