Re: [PHP-DEV] Unicode support

Rowan Collins Wed, 15 Oct 2014 05:58:59 -0700

Aleksey Tulinov wrote (on 15/10/2014):

On 15/10/14 10:04, Rowan Collins wrote:
Rowan,
As I said at the top of my first post, the important thing is to capture
what those requirements actually are. Just as you'd choose what array
functions were needed if you were adding "array support" to a language.
I'm sorry for not making myself clear. What i'm essentially saying isthat i think "noël" test is synthetic and impractical

I remain unconvinced on that, and it's just one example. There areplenty of forms which don't have a combined form, otherwise there wouldbe no need for combining diacritics to exist in the first place.

it's also solvable with requirement of NFC strings at input and thisis not implementation defect. I also believe that Hangul is mostlikely to be precomposed and will work alright.

Requiring a particular normal form on input is not something aprogramming language can do. The only way you can guarantee NFC form isby performing the normalisation.

And i have another opinion on UTF-8 shortest-form.

There's no need for opinion there, we can consult the standard.http://www.unicode.org/versions/Unicode6.0.0/

> D76 Unicode scalar value: Any Unicode code point excepthigh-surrogate and low-surrogate

code points.

> D79 A Unicode encoding form assigns each Unicode scalar value to aunique code unit

sequence.

> D77 Code unit: The minimal bit combination that can represent a unitof encoded textfor processing or interchange. [...] The Unicode Standard uses 8-bitcode units in the UTF-8 encoding form [...]> D79 A Unicode encoding form assigns each Unicode scalar value to aunique code unit

sequence.

> D85a Minimal well-formed code unit subsequence: A well-formed Unicodecode unit

sequence that maps to a single Unicode scalar value.

> D92 UTF-8 encoding form: The Unicode encoding form that assigns eachUnicode scalarvalue to an unsigned byte sequence of one to four bytes in length, asspecified in

Table 3-6 and Table 3-7.

> Before the Unicode Standard, Version 3.1, the problematic“non-shortest form”

byte sequences in UTF-8 were those where BMP characters could be represented
in more than one way. These sequences are ill-formed, because they are
not allowed by Table 3-7.

In short: UTF-8 defines a mapping between sequences of 8-bit "codeunits" to abstract "Unicode scalar values". Every Unicode scalar valuemaps to a single unique sequence of code units, but all Unicode scalarvalues can be represented. Since "U+0308 COMBINING DIAERESIS" is a validUnicode scalar value, a UTF-8 string representing that value can bewell-formed. It is only alternative representations of the same Unicodescalar value which must be in shortest form.

There may be standards for interchange in particular situations whichenforce additional constraints, such as that all strings should be inNFC, but the applicability or correct implementation of such standardsis not something that you can use to define handling in an entireprogramming language.

That aside.
I think requirements is what i was asking about, i'm assuming thatyour standpoint is that string modification routines are at leastrequired to take into account entire characters, not only code points.Am i correct?

Yes, I think that at least some functions should be available which workon "characters" as users would define them, such as length and perhapssafe truncation.

What is confusing me is that i think you're seeing it as a majorimplementation defect. To avoid arguable implementations, i've madeshort example in Java:
System.out.println(new StringBuffer("noël").reverse().toString());
It does produce string "l̈eon" as i would expect.


Why do you expect that? Is this a result which would ever be useful?

To be clear, I am suggesting that we aim to be the language which getsthis right, where other languages get it wrong.

Precomposed "noël" also works as i would expect producing string"lëon". What do you think, is this implementation issue or solelyrequirements issue?

Well, you can only define an implementation defect with respect to theoriginal requirement. If the requirement was to reverse "characters", asmost users would understand that term, then moving the diacritic to adifferent letter fails that requirement, because a user would notconsider a diacritic a separate character.

If the requirement was to reverse code points, regardless of theirmeaning, then the implementation is fine, but I would argue that therequirement failed to capture what most users would actually want.


Regards,
--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Unicode support

Reply via email to