Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Steve D'Aprano
On Mon, 23 Jan 2017 02:19 am, Marko Rauhamaa wrote: > Steve D'Aprano : > >> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: >> >>> Steve D'Aprano : >>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: > Also, [surrogates] don't exist as Unicode code points. Python > shouldn't

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Marko Rauhamaa
Steve D'Aprano : > On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: > >> Steve D'Aprano : >> >>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: Also, [surrogates] don't exist as Unicode code points. Python shouldn't allow surrogate characters in strings. >>> >>> Not quite. This

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Steve D'Aprano
On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote: > Steve D'Aprano : > >> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >>> Also, [surrogates] don't exist as Unicode code points. Python >>> shouldn't allow surrogate characters in strings. >> >> Not quite. This is where it gets a bit messy

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Marko Rauhamaa
Steve D'Aprano : > On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: >> Also, [surrogates] don't exist as Unicode code points. Python >> shouldn't allow surrogate characters in strings. > > Not quite. This is where it gets a bit messy and confusing. The bottom > line is: surrogates *are* code po

Re: PEP 393 vs UTF-8 Everywhere

2017-01-22 Thread Marko Rauhamaa
eryk sun : > On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman wrote: >> Marko Rauhamaa writes: >> py> low = '\uDC37' >>> >>> That should raise a SyntaxError exception. >> >> Quite. [...] > > CPython allows surrogate codes for use with the "surrogateescape" and > "surrogatepass" error handlers,

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steven D'Aprano
On Sunday 22 January 2017 06:58, Tim Chase wrote: > Right. It gets even weirder (edge-case'ier) when dealing with > combining characters: > > s = "man\N{COMBINING TILDE}ana" for i, c in enumerate(s): print("%i: %s" % (i, c)) > ... > 0: m > 1: a > 2: n > 3:˜ > 4: a > 5: n > 6: a '

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sun, 22 Jan 2017 07:21 am, Pete Forman wrote: > Marko Rauhamaa writes: > >>> py> low = '\uDC37' >> >> That should raise a SyntaxError exception. > > Quite. My point was that with older Python on a narrow build (Windows > and Mac) you need to understand that you are using UTF-16 rather than >

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote: > Pete Forman : > >> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 >> and UTF-32. > > Also, they don't exist as Unicode code points. Python shouldn't allow > surrogate characters in strings. Not quite. This is where it

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Tim Chase
On 2017-01-22 01:44, Steve D'Aprano wrote: > On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote: > > > but I'm hard-pressed to come up with any use case where direct > > indexing into a (non-byte)string makes sense unless you've already > > processed/searched up to that point and can use a recorded ind

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Matt Ruffalo
On 2017-01-21 10:50, Pete Forman wrote: > Thanks for a very thorough reply, most useful. I'm going to pick you up > on the above, though. > > Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 > and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC > 3629 (2003)

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread eryk sun
On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman wrote: > Marko Rauhamaa writes: > >>> py> low = '\uDC37' >> >> That should raise a SyntaxError exception. > > Quite. My point was that with older Python on a narrow build (Windows > and Mac) you need to understand that you are using UTF-16 rather than

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Pete Forman
Marko Rauhamaa writes: >> py> low = '\uDC37' > > That should raise a SyntaxError exception. Quite. My point was that with older Python on a narrow build (Windows and Mac) you need to understand that you are using UTF-16 rather than Unicode. On a wide build or Python 3.3+ then all is rosy. (At th

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Marko Rauhamaa
Pete Forman : > Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 > and UTF-32. Also, they don't exist as Unicode code points. Python shouldn't allow surrogate characters in strings. Thus the range of code points that are available for use as characters is U+–U+D7F

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Jussi Piitulainen
Chris Angelico writes: > On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote: >> Steve D'Aprano writes: >> >> [snip] >> >>> You could avoid that error by increasing the offset by the right >>> amount: >>> >>> stuff = text[offset + len("ф".encode('utf-8'):] >>> >>> which is awful. I believe th

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Chris Angelico
On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote: > Steve D'Aprano writes: > > [snip] > >> You could avoid that error by increasing the offset by the right >> amount: >> >> stuff = text[offset + len("ф".encode('utf-8'):] >> >> which is awful. I believe that's what Go and Julia expect you t

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Jussi Piitulainen
Steve D'Aprano writes: [snip] > You could avoid that error by increasing the offset by the right > amount: > > stuff = text[offset + len("ф".encode('utf-8'):] > > which is awful. I believe that's what Go and Julia expect you to do. Julia provides a method to get the next index. let text = "ἐπὶ

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Pete Forman
Steve D'Aprano writes: > [...] > Another factor which I didn't see discussed anywhere is that Python > strings treat surrogates as normal code points. I believe that would > be troublesome for a UTF-8 implementation: > > py> '\uDC37'.encode('utf-8') > Traceback (most recent call last): > File "

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote: > but I'm hard-pressed to come up with any use case where direct > indexing into a (non-byte)string makes sense unless you've already > processed/searched up to that point and can use a recorded index > from that processing/search. Let's take a simp

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote: > Can anyone point me at a rationale for PEP 393 being incorporated in > Python 3.3 over using UTF-8 as an internal string representation? I've read over the PEP, and the email discussion, and there is very little mention of UTF-8, and as far as I

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Tim Chase
On 2017-01-21 11:58, Chris Angelico wrote: > So, how could you implement this function? The current > implementation maintains an index - an integer position through the > string. It repeatedly requests the next character as string[idx], > and can also slice the string (to check for keywords like "

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Paul Rubin
Chris Angelico writes: > You can't do a look-ahead with a vanilla string iterator. That's > necessary for a lot of parsers. For JSON? For other parsers you usually have a tokenizer that reads characters with maybe 1 char of lookahead. > Yes, which gives a two-level indexing (first find the stra

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Jussi Piitulainen
Chris Angelico writes: > On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote: >> I was asserting that most useful operations on strings start from >> index 0. The r* operations would not be slowed down that much as >> UTF-8 has the useful property that attempting to interpret from a >> byte that

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin wrote: > Chris Angelico writes: >> decoding JSON... the scanner, which steps through the string and >> does the actual parsing. ... >> The only way for it to be fast enough would be to have some sort of >> retainable string iterator, which means exposin

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Paul Rubin
Chris Angelico writes: > decoding JSON... the scanner, which steps through the string and > does the actual parsing. ... > The only way for it to be fast enough would be to have some sort of > retainable string iterator, which means exposing an opaque "position > marker" that serves no purpose oth

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread MRAB
On 2017-01-21 00:51, Pete Forman wrote: MRAB writes: As someone who has written an extension, I can tell you that I much prefer dealing with a fixed number of bytes per codepoint than a variable number of bytes per codepoint, especially as I'm also supporting earlier versions of Python where t

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 11:51 AM, Pete Forman wrote: > MRAB writes: > >> As someone who has written an extension, I can tell you that I much >> prefer dealing with a fixed number of bytes per codepoint than a >> variable number of bytes per codepoint, especially as I'm also >> supporting earlier

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote: > I was asserting that most useful operations on strings start from index > 0. The r* operations would not be slowed down that much as UTF-8 has the > useful property that attempting to interpret from a byte that is not at > the start of a seque

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
MRAB writes: > As someone who has written an extension, I can tell you that I much > prefer dealing with a fixed number of bytes per codepoint than a > variable number of bytes per codepoint, especially as I'm also > supporting earlier versions of Python where that was the case. At the risk of s

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
Chris Kaynor writes: > On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: >> Can anyone point me at a rationale for PEP 393 being incorporated in >> Python 3.3 over using UTF-8 as an internal string representation? >> I've found good articles by Nick Coghlan, Armin Ronacher and others >> on the

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread MRAB
On 2017-01-20 23:06, Chris Kaynor wrote: On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: Can anyone point me at a rationale for PEP 393 being incorporated in Python 3.3 over using UTF-8 as an internal string representation? I've found good articles by Nick Coghlan, Armin Ronacher and others

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Kaynor
. On Fri, Jan 20, 2017 at 3:15 PM, Thomas Nyberg wrote: > On 01/20/2017 03:06 PM, Chris Kaynor wrote: >> >> >> [...snip...] >> >> -- >> Chris Kaynor >> > > I was able to delete my response which was a wholly contained subset of this > one. :) > > > But I have one extra question. Is string

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 10:15 AM, Thomas Nyberg wrote: > But I have one extra question. Is string indexing guaranteed to be > constant-time for python? I thought so, but I couldn't find it documented > anywhere. (Not that I think it practically matters, since it couldn't really > change if it were

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Thomas Nyberg
On 01/20/2017 03:06 PM, Chris Kaynor wrote: [...snip...] -- Chris Kaynor I was able to delete my response which was a wholly contained subset of this one. :) But I have one extra question. Is string indexing guaranteed to be constant-time for python? I thought so, but I couldn't

Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Kaynor
On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote: > Can anyone point me at a rationale for PEP 393 being incorporated in > Python 3.3 over using UTF-8 as an internal string representation? I've > found good articles by Nick Coghlan, Armin Ronacher and others on the > matter. What I have not foun