> > Well, I guess it was a bad argument because the utf8proc version looks well > > optimised anyway and we already use in multiple places. > > > > I'm done with the first version of a patch to make this switch. Please find > > it in the attachments. I also added a unit test to confirm. > > > > I think since this function depends on utf8proc it should be implemented in > > utf8proc.c file. Hence utf_width.c can be removed entirely. I guess the > > main reason it was in a separate file was to keep the dataset away from > > other code. > > > > By the way, I also found that before traversing the string, this function > > calls svn_utf__cstring_is_valid() to check validity. I think > > utf8proc_iterate() does the same check, but I really am not sure. > > svn_utf__cstring_is_valid() skips all first octets with values >0x80. Then > > it basically does the same work utf8proc_iterate() would. > > Thank you for preparing the patch! I have reviewed it and it looks good to me. > > I agree with your analysis of utf8proc_interate: it is not neccessary > to check the string first since utf8proc_iterate also return if the > string is invalid. > > I made a very simple test to see the performance: Calling > svn_utf_cstring_utf8_width() 1000 times with the fat_emojis string. > The basic version of the patch is around 25% faster compared to the > original implementation. If I remove the call to > svn_utf__cstring_is_valid, the patch is 33% faster (with a longer > string, I expect this difference to be larger).
Thanks so much for your review and feedback! I believe the 25% difference comes from bsearch that wc_width was doing, not the iteration itself. About svn_utf__cstring_is_valid() that it does, I think it's better we leave it there just because "no one knows how it'd break". -- Timofei Zhakov

