Re: Using libutf8proc for svn_utf_cstring_utf8_width()

Timofei Zhakov Sun, 17 May 2026 02:11:42 -0700

> > Well, I guess it was a bad argument because the utf8proc version looks well 
> > optimised anyway and we already use in multiple places.
> >
> > I'm done with the first version of a patch to make this switch. Please find 
> > it in the attachments. I also added a unit test to confirm.
> >
> > I think since this function depends on utf8proc it should be implemented in 
> > utf8proc.c file. Hence utf_width.c can be removed entirely. I guess the 
> > main reason it was in a separate file was to keep the dataset away from 
> > other code.
> >
> > By the way, I also found that before traversing the string, this function 
> > calls svn_utf__cstring_is_valid() to check validity. I think 
> > utf8proc_iterate() does the same check, but I really am not sure. 
> > svn_utf__cstring_is_valid() skips all first octets with values >0x80. Then 
> > it basically does the same work utf8proc_iterate() would.
>
> Thank you for preparing the patch! I have reviewed it and it looks good to me.
>
> I agree with your analysis of utf8proc_interate: it is not neccessary
> to check the string first since utf8proc_iterate also return if the
> string is invalid.
>
> I made a very simple test to see the performance: Calling
> svn_utf_cstring_utf8_width() 1000 times with the fat_emojis string.
> The basic version of the patch is around 25% faster compared to the
> original implementation. If I remove the call to
> svn_utf__cstring_is_valid, the patch is 33% faster (with a longer
> string, I expect this difference to be larger).


Thanks so much for your review and feedback!

I believe the 25% difference comes from bsearch that wc_width was
doing, not the iteration itself.

About svn_utf__cstring_is_valid() that it does, I think it's better we
leave it there just because "no one knows how it'd break".

--
Timofei Zhakov

Re: Using libutf8proc for svn_utf_cstring_utf8_width()

Reply via email to