[Removing diffutils-devel from CC.] Paul Eggert wrote: > However, mbiter's generality had a performance penalty. > > Some of the performance penalty is due to Gnulib's mbrtoc32 module > replacing mbrtoc32 on glibc. As I understand it, this is due to glibc's > mishandling of the C locale (it treats non-ASCII bytes as encoding > errors). Such a bug should not affect diffutils, as diffutils uses > mbrtoc32 only in multi-byte locales. So I'd like a way for diffutils to > use the mbrtoc32 module without replacing mbrtoc32 on glibc. In the > patch I just installed into diffutils on Savannah, this is done via a > conditional "#undef mbrtoc32" (see attached) but this is a hack and > there should be a better way. > > More of the performance penalty appears to be the mbiter module's > support for arbitrary character encodings that don't happen in practice
I've added a benchmark of mbiter to gnulib, and removed a small performance issue (mbsinit eating twice as much CPU time as needed). The timings I see now are: $ gltests/bench-mbiter abcdefghij 100000 Test Time What ---- ---- ---- Test a user 0.653 ASCII text, C locale Test b user 0.618 ASCII text, UTF-8 locale Test c user 1.841 French text, C locale Test d user 1,487 French text, ISO-8859-1 locale Test e user 1.509 French text, UTF-8 locale Test f user 15.034 Greek text, C locale Test g user 9,708 Greek text, ISO-8859-7 locale Test h user 9.871 Greek text, UTF-8 locale Test i user 4.584 Chinese text, UTF-8 locale Test j user 4.747 Chinese text, GB18030 locale The performance problems that I see are: - glibc's conversion functions are optimized for long sequences (think of iconv()). They are not optimized for short invocations (one multibyte character or less). This is a long-standing problem, that no one is attacking. - glibc's UTF-8 converter is very slow for texts with many non-ASCII characters (tests b, e, h, i). I don't think we can do anything about it. I think why test h comes out twice as slow as test i is that the same text in Greek needs more characters than the same text in Chinese (every Hanzi character is worth 2 or more characters from an alphabet). - In the C locale (tests a, c, f), conversions of bytes < 0x80 are cheap, whereas conversions of bytes >= 0x80 are expensive, because in this code path, glibc returns (size_t)-1 and mbrtoc32.c invokes hard_locale. Can we optimize the need for calling hard_locale so often, somehow? Or create a variant of mbrtoc32 that fetches the value of hard_locale from some cache (maybe a __thread variable)? Or can hard_locale itself be optimized (through dirty, glibc specific hacks)? I do *not* see a performance problem with character encodings such as ISO-8859-7 or GB18030 (tests g, j): the figures are comparable with UTF-8. > I timed mbcel on the Emacs source code and it scanned the input > significantly faster than mbiter did. How can this be? The Emacs source code is mostly ASCII, and the figures above (test a, b) show that for this case, mbiter is well optimized. > I'm thinking that mbcel would be useful in Gnulib and in other GNU > programs, and that we should create a mbcel module for it in Gnulib. I'd better try to copy the worthy optimizations into mbiter, mbuiter. The reason is that mbcel is not defining a new abstraction; it is thus somewhere in between a standard mbrtoc32 and an mbiter_multi_next invocation, and it would become more difficult to choose the right one if there are three similar interfaces. Candidates for optimization: - The C locale handling https://sourceware.org/bugzilla/show_bug.cgi?id=19932 https://sourceware.org/bugzilla/show_bug.cgi?id=29511 It's now a clear POSIX violation. Would it make sense to get this fixed in glibc, so that gnulib's override can be dropped on future glibc versions? To me, that would seem like a better approach than to have applications declare whether they insist on a POSIX compliant mbrtowc or not. - Is a functional interface faster than one that gets a 'struct' passed by reference? I would guess no, since gcc optimizes both cases well, especially when inlining. But feel free to prove me wrong. - Resetting an mbstate_t: Should we define a function void mbszero (mbstate_t *); that clears the relevant part of an mbstate_t (i.e. 24 bytes instead of 128 bytes on BSD systems)? Advantage: performance. Drawback: Yet another gnulib-invented, nonstandard API. Bruno