I think the current compute functions for casting from other String types
can be improved in several areas. I elaborate on it here
<https://github.com/apache/arrow/issues/46128>.

I give a summary here

1- The current cast compute functions consider offset+length elements for
view buffers. However, it is possible to reduce the allocation to length.
Although the bitmap buffer is copied, I think it's acceptable overhead to
prevent larger allocation

2- Overflowing is checked after  buffers are allocated, however, it's
possible to check before the allocation

Aside from that, it's possible to create a utf8() type string with more
than max int32_t in casting from StringView types to utf8(), binary() types.

3- The current algorithms consider large strings with more than max int32_t
data(length of each element less than max int32_t) as invalid casting to
utf8_view() type. I think it's possible to have a simple implementation(at
least for correctness)

4-The current implementation may cause memory bloat. Is it worthwhile to
implement a garbage collector to address this?

5-  one of the key features of StringView types is to prevent duplication. C
asting from FixedSizeBinary to BinaryView types is a good case to avoid
duplication.

Reply via email to