Re: Bug report: U+3000 IDEOGRAPHIC SPACE isn't treated as whitespace

Marnen Laibow-Koser Thu, 15 Feb 2018 12:10:13 -0800

David Kastrup wrote:

> \lyricmode does not mean "Paste arbitrary text here".
>


How is this relevant to anything I wrote?


>

LilyPond intentionally uses exclusively the ASCII character range for
> syntactic purposes.


...except it doesn't, as stated.  Lilypond source files aren't encoded in
ASCII, and anything in a source file is (potentially) syntactic, at least
as I understand that word.  Lilypond source files can contain (AFAIK) any
UTF-8 character.  And that fact, I believe, means that Lilypond has to have
at least a modicum of awareness of the properties of those characters as
guaranteed by the Unicode standard.

Consider:

1. Lilypond already recognizes multiple word-break characters: space
(U+0020), newline, tab, and so on.
2. U+3000 IDEOGRAPHIC SPACE has essentially the same semantics as U+0020
SPACE (the differences are presentational, and the two characters are
separate in Unicode largely due to historical accident).
3. Given 1. and 2., I think that it's silly to treat U+3000 semantically
differently from U+0020 just because it happens not to match a certain
7-bit legacy encoding. :)



> Everything else can be part of identifiers or
> words.


Any character can be part of a word, including {, }, \, space, and all the
rest.  That's why we have quoting constructs: "this is a syllable with { }
in it".  If the user wants a syllable with a space in it -- ideographic or
otherwise -- I think that he *should* be forced to quote it.

As for identifiers...are you saying that U+3000 IDEOGRAPHIC SPACE can be
part of an identifier?  If so...just...wow.  I think this is a bad state of
affairs.  I don't think *any* (breaking) space character should be legal in
an identifier (at least with Lilypond's syntax generally allowing spaces as
delimiters).



> That makes LilyPond documents robust against changes in Unicode.


No, *Unicode's own stability policies* make Lilypond documents (and
everything else) robust against changes in Unicode.

For background, please see the Unicode Standard, especially v. 10 §3.5 (
http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#page=28 ), as well
as UAX #44 ( http://unicode.org/reports/tr44 ) and the various stability
policies on http://www.unicode.org/policies/stability_policy.htm
(especially the Property Stability Policy and the Property Value Stability
Policies).  For issues of word segmentation and identifier syntax
specifically, please see UAX #29 ( http://www.unicode.org/reports/tr29 )
and UAX #31 ( http://www.unicode.org/reports/tr31) respectively.

Basically, Unicode defines properties Pattern_White_Space and
Pattern_Syntax (and some others) for identifier syntax, and White_Space for
general purposes as well.  In particular, the Pattern_* properties are
*immutable*; that is, once defined for a character in a given version of
the Unicode Standard, they are guaranteed to be the same for that character
in every future version.  There is a larger guarantee too, namely that a
string legal as an identifier under one version of the Standard will stay
legal as an identifier under every future version.

In reality, you probably don't need to worry about the exact minutiae of
these properties in most cases: every decent programming language these
days has at least one Unicode string library that has already implemented
logic based on them.

But the conclusion here, I think, is that changes in Unicode are not
something that we really need to worry about in this respect.  Having
established that, we can move on to what behavior will surprise the
experienced Lilypond user least.  For myself, I was *extremely* surprised
that U+3000 doesn't behave like every other space, and so I don't think
this is desirable behavior at all.

Best,
-- 
Marnen Laibow-Koser
mar...@marnen.org
http://www.marnen.org
_______________________________________________
bug-lilypond mailing list
bug-lilypond@gnu.org
https://lists.gnu.org/mailman/listinfo/bug-lilypond

Re: Bug report: U+3000 IDEOGRAPHIC SPACE isn't treated as whitespace

Reply via email to