Aleksey Tulinov wrote (on 15/10/2014):
On 15/10/14 10:04, Rowan Collins wrote:

Rowan,

As I said at the top of my first post, the important thing is to capture
what those requirements actually are. Just as you'd choose what array
functions were needed if you were adding "array support" to a language.


I'm sorry for not making myself clear. What i'm essentially saying is that i think "noël" test is synthetic and impractical

I remain unconvinced on that, and it's just one example. There are plenty of forms which don't have a combined form, otherwise there would be no need for combining diacritics to exist in the first place.

it's also solvable with requirement of NFC strings at input and this is not implementation defect. I also believe that Hangul is most likely to be precomposed and will work alright.

Requiring a particular normal form on input is not something a programming language can do. The only way you can guarantee NFC form is by performing the normalisation.

And i have another opinion on UTF-8 shortest-form.

There's no need for opinion there, we can consult the standard. http://www.unicode.org/versions/Unicode6.0.0/

> D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate
code points.
> D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit
sequence.
> D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] The Unicode Standard uses 8-bit code units in the UTF-8 encoding form [...] > D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit
sequence.
> D85a Minimal well-formed code unit subsequence: A well-formed Unicode code unit
sequence that maps to a single Unicode scalar value.
> D92 UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in
Table 3-6 and Table 3-7.
> Before the Unicode Standard, Version 3.1, the problematic “non-shortest form”
byte sequences in UTF-8 were those where BMP characters could be represented
in more than one way. These sequences are ill-formed, because they are
not allowed by Table 3-7.

In short: UTF-8 defines a mapping between sequences of 8-bit "code units" to abstract "Unicode scalar values". Every Unicode scalar value maps to a single unique sequence of code units, but all Unicode scalar values can be represented. Since "U+0308 COMBINING DIAERESIS" is a valid Unicode scalar value, a UTF-8 string representing that value can be well-formed. It is only alternative representations of the same Unicode scalar value which must be in shortest form.

There may be standards for interchange in particular situations which enforce additional constraints, such as that all strings should be in NFC, but the applicability or correct implementation of such standards is not something that you can use to define handling in an entire programming language.


That aside.

I think requirements is what i was asking about, i'm assuming that your standpoint is that string modification routines are at least required to take into account entire characters, not only code points. Am i correct?

Yes, I think that at least some functions should be available which work on "characters" as users would define them, such as length and perhaps safe truncation.


What is confusing me is that i think you're seeing it as a major implementation defect. To avoid arguable implementations, i've made short example in Java:

System.out.println(new StringBuffer("noël").reverse().toString());

It does produce string "l̈eon" as i would expect.

Why do you expect that? Is this a result which would ever be useful?

To be clear, I am suggesting that we aim to be the language which gets this right, where other languages get it wrong.

Precomposed "noël" also works as i would expect producing string "lëon". What do you think, is this implementation issue or solely requirements issue?

Well, you can only define an implementation defect with respect to the original requirement. If the requirement was to reverse "characters", as most users would understand that term, then moving the diacritic to a different letter fails that requirement, because a user would not consider a diacritic a separate character.

If the requirement was to reverse code points, regardless of their meaning, then the implementation is fine, but I would argue that the requirement failed to capture what most users would actually want.

Regards,
--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to