On Sun, 29 May 2022 13:48:49 -0400 LM <lme...@gmail.com> wrote: Dear LM,
> I like that point. Not a fan of glib and I try to avoid software > that uses it. > > Don't know how good they are, but I've run across several lighter > utf-8 C libraries: > https://github.com/cls/libutf > https://github.com/JuliaStrings/utf8proc > https://github.com/skeeto/branchless-utf8 > https://github.com/sheredom/utf8.h > https://github.com/JulienPalard/is_utf8 > > I wrote my own and use it, so I haven't tested these. Thought they > were interesting though. having dove deep into UTF-8 and Unicode, I can at least say that libutf8proc has an unsafe UTF-8-decoder, as it doesn't catch overlong encodings. There are also multiple other pitfalls. I can shamelessly recommend you my UTF-8-codec[0] that's part of my libgrapheme[1]-library, which also allows you to directly count grapheme clusters (i.e. visible character units made up of one or more codepoints). libutf8proc also offers grapheme cluster counting (among other things, but also has the unsafe UTF-8-decoder) and used to be the fastest library out there, but with a few tricks (much smaller LUTs) I managed to make libgrapheme twice as fast. I did a lot of benchmarking and tweaking and don't see any more room for improvement in the codec, given you have branches for all the edge-cases. The branchless UTF-8-decoder is very interesting, but may lead to a buffer overrun. With best regards Laslo [0]:https://git.suckless.org/libgrapheme/file/src/utf8.c.html [1]:https://git.suckless.org/libgrapheme/