Le 17/05/2021 à 16:28, Niranda Perera a écrit :
Hi all,

This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly
trivial exercise, I would like to clarify a few things.

In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd
like to get some feedback for the following points.

1. For ASCII reverse, I am throwing an error if a non-ascii char is
encountered. Should we throw this error? or return a garbage output (ex:
a\xD1b --> b\x1D\a)

Since this is taking valid UTF8 input, it should not produce invalid output, so an error should be emitted (IMHO).

2. For UTF8 reverse, I am returning some garbage output when malformed utf8
buffers are present but the algorithm guarantees that it would return the
same buffer sizes as the input. IMO, the current algorithm works
efficiently for valid UTF8 chars.

Since this is taking invalid UTF8 input, we don't care that the output is invalid as well.

3. As @DavidLi pointed out in the PR, UTF8 chars can go beyond 4 bytes (ex:
emojis, utf-8 pairs, etc). and currently these are not handled.

I'm not aware of that.  Encodings beyond 4 bytes are invalid.
See for example the IETF RFC for UTF-8:
  https://datatracker.ietf.org/doc/html/rfc3629#section-4
or the Unicode standard (chapter 3, p. 124, Table 3-7. Well-Formed UTF-8 Byte Sequences):
  https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf

Regards

Antoine.

Reply via email to