On Oct 27, 2019, at 18:00, Steven D'Aprano <[email protected]> wrote:
>
> On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas
> wrote:
>
>>> File "/home/rosuav/tmp/demo.py", line 1
>>> print("Hello, world!')
>>> ^
>>> SyntaxError: EOL while scanning string literal
>>
>> So if those 12 glyphs take 14 code units
>
> I'm not really sure how glyphs (the graphical representation of a
> character) comes into this discussion
Because, assuming your using a mono space font, the number of glyphs to the
error is how many spaces you need to indent.
This example happens to be pure ASCII, so the count of glyphs, extended
grapheme clusters, code units, and code points happens to be the name. But just
change that e to an è made of two combining code units—like the ç in your
previous example might have been—and now there are still the same number of
glyphs and clusters; but one fewer code point and one fewer code unit.
Extended grapheme clusters are intended to be the best approximation of
“characters” in the Unicode standard. Code units are not.
> but for what it's worth, I
> count 22, not 12 (excluding the leading spaces).
Sorry; that was a typo. Plus, I miscounted on top of the typo; I meant to count
the spaces.
>> because you’re using Stephen’s string and it’s in NFKD, getting 14 and
>> then indenting two spaces too many (as Python does today)
>
> You mean something like this?
>
>
> py> value = äë +* 42
> File "<stdin>", line 1
> value = äë +* 42
> ^
> SyntaxError: invalid syntax
>
> (the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}')
>
> Yes, that looks like a bug to me, but a super low priority one to fix.
Yes. It this is a general bug: everywhere that you count code units intending
to use that as a count of glyphs or characters, both in Python itself and in
third-party libraries and in applications. This is one of the most trivial
examples, and you obviously wouldn’t break backward compatibility with
everything solely to fix this example.
And I don’t know why I have to keep repeating this, but one more time: I’m not
proposing to change Python, I’m arguing to _not_ change Python, because it’s
already good enough, and the suggested improvement wouldn’t make it right
despite breaking lots of code, and making it right is a big thing that would
break even more code. If I were designing a new language, I would do it right
from the start, and it would not have this bug, or any of the other
manifestations of the same issue, but Python 4000 (or even 5000) is not an
opportunity to design a new language.
(And to be clear: Python’s design made perfect sense when it was chosen;
Unicode has just gotten more complicated since then. In fact, most other
languages that adopted Unicode as early as Python got permanently stuck with
the UCS-2 assumption, forcing all user code to deal with UTF-16 code units
forever.)
> (This is assuming that the Python interpreter promises to line the caret
> up with the offending symbol "always", rather than just making a best
> effort to do so.)
Well, the reason I called it a good-enough best effort is because I assume that
it’s only meant to be a best effort, and I think it’s good enough for that.
I’m not the one who said people would be up in arms if that were broken, I’m
the one arguing that people are fine with it being broken as long as it’s
usually good enough.
> And probably tough to fix too: I think you need to count in grapheme
> clusters, not code points,
Yes, that’s the whole point of the message you were responding to: extended
grapheme clusters are the Unicode approximation of characters; code units are
not. And a randomly-accessible sequence of grapheme clusters is impossible to
do efficiently, but a sized iterable container, or a sequence-like thing that’s
indexable by special indexes but not by integers, is. So tying the string type
even more closely to code units would not fix it; changing the way it works as
a Sequence would not fix it.
> but even that won't fix the issue since it
> leaves you open to the *opposite* problem of undercounting if the
> terminal or IDE fails to display combining characters properly:
>
> value = a¨e¨ +* 42
> ^
> SyntaxError: invalid syntax
>
> I had to fake the above, because I couldn't find a terminal on my system
> which would misdisplay COMBINING DIAERESIS, but I've seen editors do it.
That’s a matter of working around broken editors, terminals; and IDEs—which do
exist, but are uncommon, and getting less common. Not having a workaround for a
broken editor that most people don’t use is not a bug in the same way as being
broken in a properly-working environment is.
(Not having a workaround for something broken that half the users in the world
have to deal with, like Windows cmd.exe, would be a different story, of course.
You can claim that it’s Windows’ bug, not yours, but that won’t make users
happy. But I’m pretty sure that’s not an issue here.)
> Handling text in its full generality, including combining characters,
> emojis, flags, East Asian wide character, etc, is really tough to do
> right. For the Python interpreter, it would require a huge amount of
> extra work for barely any payoff since 99.9% of Python syntax errors are
> not going to include any of the funny cases.
Obviously you wouldn’t redesign the whole text API just to make syntax error
carets line up. You would do that to make thousands of different things easier
to write correctly, and lining up those carets is just one of those things, and
nowhere near the most important one.
>>> Well, either that, or we need to make it so that " "*<AbstractIndex
>>> object at 0xb7ce1bf0> results in the correct number of spaces to
>>> indent it to that position. That ought to bring in plenty of
>>> pitchforks...
>>
>> Would you still bring pitchforks for " " * StrIndex(chars=12, points=14,
>> bytes=22)?
>
> Hell yes. If I need 12 spaces, why should I be forced to work out how
> many bytes the interpreter uses for that?
If you know you need 12 spaces, you just multiply by 12; why do you think you
need to work anything out? Adding str * StrIndex doesn’t require taking away
str * int.
Your example implied that you would be working out that count in some way—say,
by calling str.find—and that you and many others would be horrified if that
return value were not an integer, but you could multiply it by a string anyway.
I don’t know why you see anything wrong with that, but I guessed that maybe it
was because you couldn’t see, at the REPL, how many spaces you were
multiplying. Having the thing that’s returned by str.find have the repr
CharIndex(chars=12, points=14, bytes=22) instead of the generic repr would
solve that. If that isn’t your problem with being able to multiply a str by a
StrIndex, then I have no other guesses for what you think people would be
raising pitchforks over.
>> This is all simple stuff; I don’t get the incredulity
>> that it could possibly be done. (Especially given that there are other
>> languages that do exactly the same thing, like Swift, which ought to
>> be proof that it’s not impossible.)
>
> Can you link to an explanation of what Swift *actually* does, in detail?
The reference documentation for String starts at
https://developer.apple.com/documentation/swift/string. (It should be the first
thing that comes up in any search engine for Swift string.) You can follow the
links from there to String.Index and String.Iterator, and from either of those
to BidirectionalCollection, and from there to Collection, which explains how
indexing works in general.
There’s probably an easier to understand description at
https://docs.swift.org/swift-book but it may not explain *exactly* what it
does, because it’s meant as a user guide.
Two things that may be confusing: Swift uses the exact same words as Python for
its iteration/etc. protocols but all with different meanings (e.g., a Swift
Sequence is a Python Iterable; a Python Sequence is a Swift
IndexableCollection; etc.), and Swift makes heavy use of static typing (e.g.,
just as there are no separate display literals for Array, Set, etc., there are
no separate display literals for Character and String; the literal "x" is a
Character if you store it in a Character lvalue, a len-1 String if you store it
in a String, and a TypeError if you sort it in a double).
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/HSEEJKV5XS5LGNS4JHD4GIPNXXMQYDVD/
Code of Conduct: http://python.org/psf/codeofconduct/