Re: [go-nuts] Re: Is []rune(invalidUTF8str) underspecified?

'David Anderson' via golang-nuts Tue, 09 Jul 2024 17:24:05 -0700

On Tue, Jul 9, 2024, at 15:22, peterGo wrote:
> 
> On Tuesday, July 9, 2024 at 3:42:51 PM UTC-4 David Anderson wrote:
>> __
>> I've been going over the spec to clarify finer points of how string vs. 
>> []byte behave, I think there may be an unnecessary degree of freedom that 
>> could be removed. Either that, or I missed a load-bearing statement that 
>> constrains implementations.
>> 
>> In https://go.dev/ref/spec#Conversions, `[]rune(str)` is specified as: 
>> "Converting a value of a string type to a slice of runes type yields a slice 
>> containing the individual Unicode code points of the string."
>> 
>> This does not specify the behavior if the string contains invalid UTF-8 byte 
>> sequences. If my reading is correct, a compliant implementation would be 
>> free to panic() on such a conversion, or implement the conversion in an 
>> arbitrary way of its choosing.
>> 
>> - Dave
> 
> A run-time panic requires explicit mention.
> 
> UTF-8 is defined by the Unicode Standard: 
> https://www.unicode.org/versions/latest. Does the Unicode Standard allow 
> arbitrary conversion behavior?


Not arbitrary, but behaviors other than what Go currently implements, yes. I've 
been reading the standard to try and get a precise answer, everything below is 
my current understanding based on a small amount of time with the standard 
document. I may be missing other relevant parts of the standard or its annexes.

Section 3.9[1] defines UTF-8 in terms of non overlapping well-formed and 
ill-formed byte sequences. A conformant implementation must not mistakenly 
decode ill-formed sequences as valid runes, and must decode all well-formed 
sequences to the correct runes.

All correct implementations will identify the same well-formed and ill-formed 
byte ranges, and produce the same rune sequences for the well-formed ranges. 
But ill-formed sequences are less nailed down, aside from the rule that you 
must not unintentionally map them to valid Unicode characters. But you can 
abort with an error, silently ignore the ill-formed sequence (in practice 
nobody does, security problems), or you can produce context-appropriate 
replacement characters, conventionally (but not necessarily) U+FFFD.

But even if you assume replacement with U+FFFD, a run of 3 invalid bytes can be 
decoded as 1, 2 or 3 U+FFFD characters. All are valid replacements according to 
the spec[2] as long as you don't break decoding of valid characters on either 
side of the invalid bytes.

The standard does point at a specific unambiguous algorithm for handling 
ill-formed sequences[3], on page 127 of the v15.1 "U+FFFD Substitution of 
Maximal Subparts". This references a W3C spec[4] which defines a single mapping 
of an ill-formed sequence to one or more U+FFFD characters. However, the 
standard explicitly says that following W3C's algorithm is not required for 
conformance, and doesn't use the word "recommends" either - although I feel it 
invites you to come to your own conclusions from its prominent placement in the 
main standard document.

I do not believe that for...range as specified implements the algorithm offered 
by W3C. The W3C algorithm emits a single U+FFFD for runs of ill-formed bytes 
(although not always a single one per run - but it's deterministic). Range 
iteration is specified to always advance the input by one byte per U+FFFD 
produced. That's fine, it's a conformant behavior, and the spec describes it 
sufficiently to implement. It just means the Go spec has to describe its 
behavior explicitly, rather than by reference to W3C or Unicode documents.


To bring it back to the Go spec: as currently specified, if panics are off the 
table, I believe a conformant implementation could implement "[]rune(notUTF8)" 
by silently discarding the ill-formed bytes, or by producing U+FFFD 
replacements in the same way range iteration does, or by producing U+FFFD or 
any other replacement characters in any other amount.

In practice, as you point out, the original Go implementation does the obvious 
thing and reuses the range iteration behavior. Would it be reasonable to nail 
down that `[]rune(foo)` must behave the same as range iteration in all 
implementations?

- Dave

[1]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 124
[2]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 126 
"Constraints on Conversion Processes"
[3]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 127 "U+FFFD 
Substitution of Maximal Subparts"
[4]: https://encoding.spec.whatwg.org/#utf-8-decoder


> 
> peter
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/d7005b18-e5f0-41ae-bf4c-c6dfe825b7d2n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/golang-nuts/d7005b18-e5f0-41ae-bf4c-c6dfe825b7d2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/a4a9fe53-505a-475f-9d86-8d22302f0428%40app.fastmail.com.

Re: [go-nuts] Re: Is []rune(invalidUTF8str) underspecified?

Reply via email to