A little clarification on my point: it's not that a single codepoint
gets encoded with more than four bytes, it's that a grapheme
cluster/human-delimited 'character' might be multiple codepoints, so
reversing the individual codepoints may produce an unexpected
result. For instance a flag emoji is actually two codepoints (two
special 'letter' codepoints that represent the country code), so
reversing a US flag naively will give you an odd '[SU]' instead.

Not that this needs to be handled per se right now - but we should
perhaps point it out in the kernel documentation so people know what
to expect.

-David

On 2021/05/17 14:48:52, Antoine Pitrou <anto...@python.org> wrote: 
> 
> Le 17/05/2021 à 16:28, Niranda Perera a écrit :
> > Hi all,
> > 
> > This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly
> > trivial exercise, I would like to clarify a few things.
> > 
> > In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd
> > like to get some feedback for the following points.
> > 
> > 1. For ASCII reverse, I am throwing an error if a non-ascii char is
> > encountered. Should we throw this error? or return a garbage output (ex:
> > a\xD1b --> b\x1D\a)
> 
> Since this is taking valid UTF8 input, it should not produce invalid 
> output, so an error should be emitted (IMHO).
> 
> > 2. For UTF8 reverse, I am returning some garbage output when malformed utf8
> > buffers are present but the algorithm guarantees that it would return the
> > same buffer sizes as the input. IMO, the current algorithm works
> > efficiently for valid UTF8 chars.
> 
> Since this is taking invalid UTF8 input, we don't care that the output 
> is invalid as well.
> 
> > 3. As @DavidLi pointed out in the PR, UTF8 chars can go beyond 4 bytes (ex:
> > emojis, utf-8 pairs, etc). and currently these are not handled.
> 
> I'm not aware of that.  Encodings beyond 4 bytes are invalid.
> See for example the IETF RFC for UTF-8:
>    https://datatracker.ietf.org/doc/html/rfc3629#section-4
> or the Unicode standard (chapter 3, p. 124, Table 3-7. Well-Formed UTF-8 
> Byte Sequences):
>    https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf
> 
> Regards
> 
> Antoine.
> 

Reply via email to