Hi all,

This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly
trivial exercise, I would like to clarify a few things.

In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd
like to get some feedback for the following points.

1. For ASCII reverse, I am throwing an error if a non-ascii char is
encountered. Should we throw this error? or return a garbage output (ex:
a\xD1b --> b\x1D\a)
2. For UTF8 reverse, I am returning some garbage output when malformed utf8
buffers are present but the algorithm guarantees that it would return the
same buffer sizes as the input. IMO, the current algorithm works
efficiently for valid UTF8 chars.
#1 and #2 are inconsistent and I'd like to know what is the best way to
handle malformed/ invalid chars
3. As @DavidLi pointed out in the PR, UTF8 chars can go beyond 4 bytes (ex:
emojis, utf-8 pairs, etc). and currently these are not handled.

Look forward to hearing from you.

Best

[1] https://github.com/apache/arrow/pull/10317
[2] https://issues.apache.org/jira/browse/ARROW-12713

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to