Re: upper() / lower() for utf8 strings

Maarten Breddels Wed, 23 Dec 2020 11:48:15 -0800

Hi Sagnik,

it might be worth taking a look at https://github.com/apache/arrow/pull/7449
(that kernel code of mine is a but cumbersome,
TLDR version:
unilib is faster than utf8proc, but there are licensing issues with unilib.
We instead use a LUT to accelerate, at the cost of some memory (would be
great if we at least shared the LUTs).


cpp/src/arrow/util/utf8.h
and cpp/src/arrow/compute/kernels/scalar_string.cc might be useful to take
a look at.
At https://issues.apache.org/jira/browse/ARROW-555 there was a bit of
discussion on using the same codebase for the arrow kernel and Gandiva, but
that never got off the ground.
So yes, if you can do what Wes suggests, that would be great.

cheers,

Maarten Breddels
Software engineer / consultant / data scientist
Python / C++ / Javascript / Jupyter
www.maartenbreddels.com / vaex.io
maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
[image: Twitter] <https://twitter.com/maartenbreddels>[image: Github]
<https://github.com/maartenbreddels>[image: LinkedIn]
<https://linkedin.com/in/maartenbreddels>[image: Skype]




On Wed, Dec 23, 2020 at 4:48 PM Wes McKinney <wesmck...@gmail.com> wrote:

> It might be worthwhile to see if some reusable templates can be
> assembled that can be employed in both places
>
> On Tue, Dec 22, 2020 at 5:47 PM Neal Richardson
> <neal.p.richard...@gmail.com> wrote:
> >
> > FWIW the C++ compute library now uses
> > https://github.com/JuliaStrings/utf8proc, so assuming it does all of the
> > things you want, it could save you some trouble if you used it in Gandiva
> > too--cmake is already set up to use it.
> >
> > Neal
> >
> > On Tue, Dec 22, 2020 at 3:41 PM Sagnik Chakraborty <sagn...@dremio.com>
> > wrote:
> >
> > > We are looking to implement upper() / lower() for non-ASCII characters.
> > > The current Gandiva implementation handles upper() / lower() only for
> > > standard ASCII characters.
> > >
> > > For the implementation in Gandiva, I went through a few articles and
> > > answers on StackOverflow and the top answer to this question <
> > >
> https://stackoverflow.com/questions/36897781/how-to-uppercase-lowercase-utf-8-characters-in-c
> >
> > > suggests that there is no standard way to do Unicode case conversion in
> > > C/C++ and that an external library like ICU <
> > > https://unicode-org.github.io/icu-docs/#/icu4c/> is necessary to
> ensure
> > > guaranteed Unicode case conversion.
> > >
> > > So, I just wanted to know that while adding any external library in
> > > Gandiva, what are the issues that we need to take care of in order to
> > > ensure that we do not break existing code and not sacrifice on
> performance
> > > as well? Is there any existing library that we can make use of to go
> about
> > > solving this problem? Any suggestions would be welcome.
> > >
> > > Regards,
> > > Sagnik
>

Re: upper() / lower() for utf8 strings

Reply via email to