Aleksey Tulinov wrote (on 15/10/2014):
On 15/10/14 10:04, Rowan Collins wrote:
Rowan,
As I said at the top of my first post, the important thing is to capture
what those requirements actually are. Just as you'd choose what array
functions were needed if you were adding "array support" to a language.
I'm sorry for not making myself clear. What i'm essentially saying is
that i think "noël" test is synthetic and impractical
I remain unconvinced on that, and it's just one example. There are
plenty of forms which don't have a combined form, otherwise there would
be no need for combining diacritics to exist in the first place.
it's also solvable with requirement of NFC strings at input and this
is not implementation defect. I also believe that Hangul is most
likely to be precomposed and will work alright.
Requiring a particular normal form on input is not something a
programming language can do. The only way you can guarantee NFC form is
by performing the normalisation.
And i have another opinion on UTF-8 shortest-form.
There's no need for opinion there, we can consult the standard.
http://www.unicode.org/versions/Unicode6.0.0/
> D76 Unicode scalar value: Any Unicode code point except
high-surrogate and low-surrogate
code points.
> D79 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit
sequence.
> D77 Code unit: The minimal bit combination that can represent a unit
of encoded text
for processing or interchange. [...] The Unicode Standard uses 8-bit
code units in the UTF-8 encoding form [...]
> D79 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit
sequence.
> D85a Minimal well-formed code unit subsequence: A well-formed Unicode
code unit
sequence that maps to a single Unicode scalar value.
> D92 UTF-8 encoding form: The Unicode encoding form that assigns each
Unicode scalar
value to an unsigned byte sequence of one to four bytes in length, as
specified in
Table 3-6 and Table 3-7.
> Before the Unicode Standard, Version 3.1, the problematic
“non-shortest form”
byte sequences in UTF-8 were those where BMP characters could be represented
in more than one way. These sequences are ill-formed, because they are
not allowed by Table 3-7.
In short: UTF-8 defines a mapping between sequences of 8-bit "code
units" to abstract "Unicode scalar values". Every Unicode scalar value
maps to a single unique sequence of code units, but all Unicode scalar
values can be represented. Since "U+0308 COMBINING DIAERESIS" is a valid
Unicode scalar value, a UTF-8 string representing that value can be
well-formed. It is only alternative representations of the same Unicode
scalar value which must be in shortest form.
There may be standards for interchange in particular situations which
enforce additional constraints, such as that all strings should be in
NFC, but the applicability or correct implementation of such standards
is not something that you can use to define handling in an entire
programming language.
That aside.
I think requirements is what i was asking about, i'm assuming that
your standpoint is that string modification routines are at least
required to take into account entire characters, not only code points.
Am i correct?
Yes, I think that at least some functions should be available which work
on "characters" as users would define them, such as length and perhaps
safe truncation.
What is confusing me is that i think you're seeing it as a major
implementation defect. To avoid arguable implementations, i've made
short example in Java:
System.out.println(new StringBuffer("noël").reverse().toString());
It does produce string "l̈eon" as i would expect.
Why do you expect that? Is this a result which would ever be useful?
To be clear, I am suggesting that we aim to be the language which gets
this right, where other languages get it wrong.
Precomposed "noël" also works as i would expect producing string
"lëon". What do you think, is this implementation issue or solely
requirements issue?
Well, you can only define an implementation defect with respect to the
original requirement. If the requirement was to reverse "characters", as
most users would understand that term, then moving the diacritic to a
different letter fails that requirement, because a user would not
consider a diacritic a separate character.
If the requirement was to reverse code points, regardless of their
meaning, then the implementation is fine, but I would argue that the
requirement failed to capture what most users would actually want.
Regards,
--
Rowan Collins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php