On Tue, Jul 9, 2024, at 15:22, peterGo wrote: > > On Tuesday, July 9, 2024 at 3:42:51 PM UTC-4 David Anderson wrote: >> __ >> I've been going over the spec to clarify finer points of how string vs. >> []byte behave, I think there may be an unnecessary degree of freedom that >> could be removed. Either that, or I missed a load-bearing statement that >> constrains implementations. >> >> In https://go.dev/ref/spec#Conversions, `[]rune(str)` is specified as: >> "Converting a value of a string type to a slice of runes type yields a slice >> containing the individual Unicode code points of the string." >> >> This does not specify the behavior if the string contains invalid UTF-8 byte >> sequences. If my reading is correct, a compliant implementation would be >> free to panic() on such a conversion, or implement the conversion in an >> arbitrary way of its choosing. >> >> - Dave > > A run-time panic requires explicit mention. > > UTF-8 is defined by the Unicode Standard: > https://www.unicode.org/versions/latest. Does the Unicode Standard allow > arbitrary conversion behavior?
Not arbitrary, but behaviors other than what Go currently implements, yes. I've been reading the standard to try and get a precise answer, everything below is my current understanding based on a small amount of time with the standard document. I may be missing other relevant parts of the standard or its annexes. Section 3.9[1] defines UTF-8 in terms of non overlapping well-formed and ill-formed byte sequences. A conformant implementation must not mistakenly decode ill-formed sequences as valid runes, and must decode all well-formed sequences to the correct runes. All correct implementations will identify the same well-formed and ill-formed byte ranges, and produce the same rune sequences for the well-formed ranges. But ill-formed sequences are less nailed down, aside from the rule that you must not unintentionally map them to valid Unicode characters. But you can abort with an error, silently ignore the ill-formed sequence (in practice nobody does, security problems), or you can produce context-appropriate replacement characters, conventionally (but not necessarily) U+FFFD. But even if you assume replacement with U+FFFD, a run of 3 invalid bytes can be decoded as 1, 2 or 3 U+FFFD characters. All are valid replacements according to the spec[2] as long as you don't break decoding of valid characters on either side of the invalid bytes. The standard does point at a specific unambiguous algorithm for handling ill-formed sequences[3], on page 127 of the v15.1 "U+FFFD Substitution of Maximal Subparts". This references a W3C spec[4] which defines a single mapping of an ill-formed sequence to one or more U+FFFD characters. However, the standard explicitly says that following W3C's algorithm is not required for conformance, and doesn't use the word "recommends" either - although I feel it invites you to come to your own conclusions from its prominent placement in the main standard document. I do not believe that for...range as specified implements the algorithm offered by W3C. The W3C algorithm emits a single U+FFFD for runs of ill-formed bytes (although not always a single one per run - but it's deterministic). Range iteration is specified to always advance the input by one byte per U+FFFD produced. That's fine, it's a conformant behavior, and the spec describes it sufficiently to implement. It just means the Go spec has to describe its behavior explicitly, rather than by reference to W3C or Unicode documents. To bring it back to the Go spec: as currently specified, if panics are off the table, I believe a conformant implementation could implement "[]rune(notUTF8)" by silently discarding the ill-formed bytes, or by producing U+FFFD replacements in the same way range iteration does, or by producing U+FFFD or any other replacement characters in any other amount. In practice, as you point out, the original Go implementation does the obvious thing and reuses the range iteration behavior. Would it be reasonable to nail down that `[]rune(foo)` must behave the same as range iteration in all implementations? - Dave [1]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 124 [2]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 126 "Constraints on Conversion Processes" [3]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 127 "U+FFFD Substitution of Maximal Subparts" [4]: https://encoding.spec.whatwg.org/#utf-8-decoder > > peter > > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/d7005b18-e5f0-41ae-bf4c-c6dfe825b7d2n%40googlegroups.com > > <https://groups.google.com/d/msgid/golang-nuts/d7005b18-e5f0-41ae-bf4c-c6dfe825b7d2n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/a4a9fe53-505a-475f-9d86-8d22302f0428%40app.fastmail.com.