On 03/28/2013 08:34 PM, Neil Hodgson wrote:
Steven D'Aprano:
Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.
It doesn't horrify me - I've been working this way for over 10 years and it
seems completely natural.
Horrifying or not, I am willing to give up a small amount of speed for correctness. Heck, I'm willing to give up a lot
of speed for correctness. Once I have my slow but correct prototype going I can recode in a faster language (if needed)
and compare it's blazingly fast output with my slowly-generated but known-good output.
You can wrap
access in iterators that hide the byte offsets if you like. This then ensures
that all operations on those iterators are
safe only allowing the iterator to point at the start/end of valid characters.
Sure. Or I can let Python handle it for me.
The counter-problem is that a French document that needs to include one
mathematical symbol (or emoji) outside
Latin-1 will double in size as a Python string.
True. But how often do you have the entire document as a single string? Use readlines() instead of read(). Besides,
memory is cheap.
--
~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list