A little clarification on my point: it's not that a single codepoint gets encoded with more than four bytes, it's that a grapheme cluster/human-delimited 'character' might be multiple codepoints, so reversing the individual codepoints may produce an unexpected result. For instance a flag emoji is actually two codepoints (two special 'letter' codepoints that represent the country code), so reversing a US flag naively will give you an odd '[SU]' instead.
Not that this needs to be handled per se right now - but we should perhaps point it out in the kernel documentation so people know what to expect. -David On 2021/05/17 14:48:52, Antoine Pitrou <anto...@python.org> wrote: > > Le 17/05/2021 à 16:28, Niranda Perera a écrit : > > Hi all, > > > > This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly > > trivial exercise, I would like to clarify a few things. > > > > In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd > > like to get some feedback for the following points. > > > > 1. For ASCII reverse, I am throwing an error if a non-ascii char is > > encountered. Should we throw this error? or return a garbage output (ex: > > a\xD1b --> b\x1D\a) > > Since this is taking valid UTF8 input, it should not produce invalid > output, so an error should be emitted (IMHO). > > > 2. For UTF8 reverse, I am returning some garbage output when malformed utf8 > > buffers are present but the algorithm guarantees that it would return the > > same buffer sizes as the input. IMO, the current algorithm works > > efficiently for valid UTF8 chars. > > Since this is taking invalid UTF8 input, we don't care that the output > is invalid as well. > > > 3. As @DavidLi pointed out in the PR, UTF8 chars can go beyond 4 bytes (ex: > > emojis, utf-8 pairs, etc). and currently these are not handled. > > I'm not aware of that. Encodings beyond 4 bytes are invalid. > See for example the IETF RFC for UTF-8: > https://datatracker.ietf.org/doc/html/rfc3629#section-4 > or the Unicode standard (chapter 3, p. 124, Table 3-7. Well-Formed UTF-8 > Byte Sequences): > https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf > > Regards > > Antoine. >