On 3/22/2013 4:08 AM, Philippe Verdy wrote:
2013/3/22 Asmus Freytag <[email protected]>:
If you need to annotate text with the results of semantic analysis as
performed by a human reader, then you either need XML, or some other format
that can express that particular intent.
Absolutely NO. If this encodes semantics, this is part of plain text,
I think we are on a different page here. In some ways the Unicode term
"semantics" is very misleading in this context. What Unicode means by
this fancy term is the character's identity - not it's use.
If you use a colon to mark abbreviation (as in Swedish) you are using a
colon - the use may be very different from how a colon is used
elsewhere, but it does not create a new character.
Unicode does not encode the semantics of a sentence or word, but
provides a string of characters of known identity that lets a human
reader determine the semantics of that sentence or word as unambiguously
as if that sentence had been reproduced by analog means - that's, in a
nutshell, what Unicode attempts to do.
and not part of an upper layer protocol. Notably these characters
should be used to alter de default (ambiguous) character properties of
the characters they modify, and notably to give them the semantics
needed for existing Unicode algorithms (general categories:
punctuation, diacritic; word-breaking properties...)
Character properties define the *default* behavior of a given
character. There are many examples, especially in the context of
punctuation where a character can have different uses. Each use may need
a different treatment by readers (or algorithms).
To handle some behaviors, you may need complex processing (natural
language processing) that mimics what a human reader can do.
There are a few exceptions where characters are disunified based on
properties - the most principled of these involve properties that can't
be modified, such as the bidi property. There are about a dozen
characters that look entirely alike (by design and derivation) yet have
been disunified based on bidi properties - because bidi properties
cannot be overridden.
There are a few other cases, usually where a character can be both
letter and punctuation where such disunifications were made based on
overridable properties. Here the reason was that this distinction has
such a wide reach (and hat to be applied by many basic algorithms) that
breaking the principle of single character identity can be justified.
If a problem is sufficiently severe, then you'd possibly have
justification to disunify. If not, then the answer would be outside the
scope of character encoding.
adding new variants of existing characters like what was done
specifically for maths is not a stabl long term solution; solutions
similar to variant selectors however are much more meaningful, and
will allow for example to make the distinction between a MIDDLE DOT
punctuation and an ANO TELEIA, and will also allow them to be rendered
differently (even if there's no requirement to do so).
This is absolutely not "pseudo-coding".
"Pseudo coding" refers to making distinctions between characters not on
their basic encoding, but by means of "attributes" such as the selectors
you are suggesting.