unicode and the FSR [was: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Ethan Furman Thu, 28 Mar 2013 22:05:54 -0700

On 03/28/2013 08:34 PM, Neil Hodgson wrote:

Steven D'Aprano:

Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.


    It doesn't horrify me - I've been working this way for over 10 years and it 
seems completely natural.

Horrifying or not, I am willing to give up a small amount of speed for correctness. Heck, I'm willing to give up a lotof speed for correctness. Once I have my slow but correct prototype going I can recode in a faster language (if needed)and compare it's blazingly fast output with my slowly-generated but known-good output.

 You can wrap
access in iterators that hide the byte offsets if you like. This then ensures 
that all operations on those iterators are
safe only allowing the iterator to point at the start/end of valid characters.


Sure.  Or I can let Python handle it for me.

    The counter-problem is that a French document that needs to include one 
mathematical symbol (or emoji) outside
Latin-1 will double in size as a Python string.

True. But how often do you have the entire document as a single string? Use readlines() instead of read(). Besides,memory is cheap.


--
~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

unicode and the FSR [was: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

Reply via email to