On Mon, 23 Jan 2017 02:19 am, Marko Rauhamaa wrote:
> Steve D'Aprano :
>
>> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>>
>>> Steve D'Aprano :
>>>
On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
> Also, [surrogates] don't exist as Unicode code points. Python
> shouldn't
Steve D'Aprano :
> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>
>> Steve D'Aprano :
>>
>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
Also, [surrogates] don't exist as Unicode code points. Python
shouldn't allow surrogate characters in strings.
>>>
>>> Not quite. This
On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
> Steve D'Aprano :
>
>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>> Also, [surrogates] don't exist as Unicode code points. Python
>>> shouldn't allow surrogate characters in strings.
>>
>> Not quite. This is where it gets a bit messy
Steve D'Aprano :
> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>> Also, [surrogates] don't exist as Unicode code points. Python
>> shouldn't allow surrogate characters in strings.
>
> Not quite. This is where it gets a bit messy and confusing. The bottom
> line is: surrogates *are* code po
eryk sun :
> On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman wrote:
>> Marko Rauhamaa writes:
>>
py> low = '\uDC37'
>>>
>>> That should raise a SyntaxError exception.
>>
>> Quite. [...]
>
> CPython allows surrogate codes for use with the "surrogateescape" and
> "surrogatepass" error handlers,
On Sunday 22 January 2017 06:58, Tim Chase wrote:
> Right. It gets even weirder (edge-case'ier) when dealing with
> combining characters:
>
>
s = "man\N{COMBINING TILDE}ana"
for i, c in enumerate(s): print("%i: %s" % (i, c))
> ...
> 0: m
> 1: a
> 2: n
> 3:˜
> 4: a
> 5: n
> 6: a
'
On Sun, 22 Jan 2017 07:21 am, Pete Forman wrote:
> Marko Rauhamaa writes:
>
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
>
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
>
On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
> Pete Forman :
>
>> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
>> and UTF-32.
>
> Also, they don't exist as Unicode code points. Python shouldn't allow
> surrogate characters in strings.
Not quite. This is where it
On 2017-01-22 01:44, Steve D'Aprano wrote:
> On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:
>
> > but I'm hard-pressed to come up with any use case where direct
> > indexing into a (non-byte)string makes sense unless you've already
> > processed/searched up to that point and can use a recorded ind
On 2017-01-21 10:50, Pete Forman wrote:
> Thanks for a very thorough reply, most useful. I'm going to pick you up
> on the above, though.
>
> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
> 3629 (2003)
On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman wrote:
> Marko Rauhamaa writes:
>
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
>
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
Marko Rauhamaa writes:
>> py> low = '\uDC37'
>
> That should raise a SyntaxError exception.
Quite. My point was that with older Python on a narrow build (Windows
and Mac) you need to understand that you are using UTF-16 rather than
Unicode. On a wide build or Python 3.3+ then all is rosy. (At th
Pete Forman :
> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32.
Also, they don't exist as Unicode code points. Python shouldn't allow
surrogate characters in strings.
Thus the range of code points that are available for use as
characters is U+–U+D7F
Chris Angelico writes:
> On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote:
>> Steve D'Aprano writes:
>>
>> [snip]
>>
>>> You could avoid that error by increasing the offset by the right
>>> amount:
>>>
>>> stuff = text[offset + len("ф".encode('utf-8'):]
>>>
>>> which is awful. I believe th
On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen
wrote:
> Steve D'Aprano writes:
>
> [snip]
>
>> You could avoid that error by increasing the offset by the right
>> amount:
>>
>> stuff = text[offset + len("ф".encode('utf-8'):]
>>
>> which is awful. I believe that's what Go and Julia expect you t
Steve D'Aprano writes:
[snip]
> You could avoid that error by increasing the offset by the right
> amount:
>
> stuff = text[offset + len("ф".encode('utf-8'):]
>
> which is awful. I believe that's what Go and Julia expect you to do.
Julia provides a method to get the next index.
let text = "ἐπὶ
Steve D'Aprano writes:
> [...]
> Another factor which I didn't see discussed anywhere is that Python
> strings treat surrogates as normal code points. I believe that would
> be troublesome for a UTF-8 implementation:
>
> py> '\uDC37'.encode('utf-8')
> Traceback (most recent call last):
> File "
On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:
> but I'm hard-pressed to come up with any use case where direct
> indexing into a (non-byte)string makes sense unless you've already
> processed/searched up to that point and can use a recorded index
> from that processing/search.
Let's take a simp
On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote:
> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation?
I've read over the PEP, and the email discussion, and there is very little
mention of UTF-8, and as far as I
On 2017-01-21 11:58, Chris Angelico wrote:
> So, how could you implement this function? The current
> implementation maintains an index - an integer position through the
> string. It repeatedly requests the next character as string[idx],
> and can also slice the string (to check for keywords like "
Chris Angelico writes:
> You can't do a look-ahead with a vanilla string iterator. That's
> necessary for a lot of parsers.
For JSON? For other parsers you usually have a tokenizer that reads
characters with maybe 1 char of lookahead.
> Yes, which gives a two-level indexing (first find the stra
Chris Angelico writes:
> On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote:
>> I was asserting that most useful operations on strings start from
>> index 0. The r* operations would not be slowed down that much as
>> UTF-8 has the useful property that attempting to interpret from a
>> byte that
On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin wrote:
> Chris Angelico writes:
>> decoding JSON... the scanner, which steps through the string and
>> does the actual parsing. ...
>> The only way for it to be fast enough would be to have some sort of
>> retainable string iterator, which means exposin
Chris Angelico writes:
> decoding JSON... the scanner, which steps through the string and
> does the actual parsing. ...
> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose oth
On 2017-01-21 00:51, Pete Forman wrote:
MRAB writes:
As someone who has written an extension, I can tell you that I much
prefer dealing with a fixed number of bytes per codepoint than a
variable number of bytes per codepoint, especially as I'm also
supporting earlier versions of Python where t
On Sat, Jan 21, 2017 at 11:51 AM, Pete Forman wrote:
> MRAB writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier
On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote:
> I was asserting that most useful operations on strings start from index
> 0. The r* operations would not be slowed down that much as UTF-8 has the
> useful property that attempting to interpret from a byte that is not at
> the start of a seque
MRAB writes:
> As someone who has written an extension, I can tell you that I much
> prefer dealing with a fixed number of bytes per codepoint than a
> variable number of bytes per codepoint, especially as I'm also
> supporting earlier versions of Python where that was the case.
At the risk of s
Chris Kaynor writes:
> On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote:
>> Can anyone point me at a rationale for PEP 393 being incorporated in
>> Python 3.3 over using UTF-8 as an internal string representation?
>> I've found good articles by Nick Coghlan, Armin Ronacher and others
>> on the
On 2017-01-20 23:06, Chris Kaynor wrote:
On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote:
Can anyone point me at a rationale for PEP 393 being incorporated in
Python 3.3 over using UTF-8 as an internal string representation? I've
found good articles by Nick Coghlan, Armin Ronacher and others
.
On Fri, Jan 20, 2017 at 3:15 PM, Thomas Nyberg wrote:
> On 01/20/2017 03:06 PM, Chris Kaynor wrote:
>>
>>
>> [...snip...]
>>
>> --
>> Chris Kaynor
>>
>
> I was able to delete my response which was a wholly contained subset of this
> one. :)
>
>
> But I have one extra question. Is string
On Sat, Jan 21, 2017 at 10:15 AM, Thomas Nyberg wrote:
> But I have one extra question. Is string indexing guaranteed to be
> constant-time for python? I thought so, but I couldn't find it documented
> anywhere. (Not that I think it practically matters, since it couldn't really
> change if it were
On 01/20/2017 03:06 PM, Chris Kaynor wrote:
[...snip...]
--
Chris Kaynor
I was able to delete my response which was a wholly contained subset of
this one. :)
But I have one extra question. Is string indexing guaranteed to be
constant-time for python? I thought so, but I couldn't
On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman wrote:
> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation? I've
> found good articles by Nick Coghlan, Armin Ronacher and others on the
> matter. What I have not foun
34 matches
Mail list logo