FWIW the C++ compute library now uses https://github.com/JuliaStrings/utf8proc, so assuming it does all of the things you want, it could save you some trouble if you used it in Gandiva too--cmake is already set up to use it.
Neal On Tue, Dec 22, 2020 at 3:41 PM Sagnik Chakraborty <sagn...@dremio.com> wrote: > We are looking to implement upper() / lower() for non-ASCII characters. > The current Gandiva implementation handles upper() / lower() only for > standard ASCII characters. > > For the implementation in Gandiva, I went through a few articles and > answers on StackOverflow and the top answer to this question < > https://stackoverflow.com/questions/36897781/how-to-uppercase-lowercase-utf-8-characters-in-c> > suggests that there is no standard way to do Unicode case conversion in > C/C++ and that an external library like ICU < > https://unicode-org.github.io/icu-docs/#/icu4c/> is necessary to ensure > guaranteed Unicode case conversion. > > So, I just wanted to know that while adding any external library in > Gandiva, what are the issues that we need to take care of in order to > ensure that we do not break existing code and not sacrifice on performance > as well? Is there any existing library that we can make use of to go about > solving this problem? Any suggestions would be welcome. > > Regards, > Sagnik