Re: mbcel module for Gnulib?

Bruno Haible Tue, 11 Jul 2023 15:15:21 -0700

[Removing diffutils-devel from CC.]

Paul Eggert wrote:
> However, mbiter's generality had a performance penalty.
> 
> Some of the performance penalty is due to Gnulib's mbrtoc32 module 
> replacing mbrtoc32 on glibc. As I understand it, this is due to glibc's 
> mishandling of the C locale (it treats non-ASCII bytes as encoding 
> errors). Such a bug should not affect diffutils, as diffutils uses 
> mbrtoc32 only in multi-byte locales. So I'd like a way for diffutils to 
> use the mbrtoc32 module without replacing mbrtoc32 on glibc. In the 
> patch I just installed into diffutils on Savannah, this is done via a 
> conditional "#undef mbrtoc32" (see attached) but this is a hack and 
> there should be a better way.
> 
> More of the performance penalty appears to be the mbiter module's 
> support for arbitrary character encodings that don't happen in practice


I've added a benchmark of mbiter to gnulib, and removed a small
performance issue (mbsinit eating twice as much CPU time as needed).

The timings I see now are:

$ gltests/bench-mbiter abcdefghij 100000

Test       Time          What
----       ----          ----
Test a   user   0.653    ASCII text, C locale
Test b   user   0.618    ASCII text, UTF-8 locale
Test c   user   1.841    French text, C locale
Test d   user   1,487    French text, ISO-8859-1 locale
Test e   user   1.509    French text, UTF-8 locale
Test f   user  15.034    Greek text, C locale
Test g   user   9,708    Greek text, ISO-8859-7 locale
Test h   user   9.871    Greek text, UTF-8 locale
Test i   user   4.584    Chinese text, UTF-8 locale
Test j   user   4.747    Chinese text, GB18030 locale

The performance problems that I see are:

  - glibc's conversion functions are optimized for long sequences
    (think of iconv()). They are not optimized for short invocations
    (one multibyte character or less). This is a long-standing problem,
    that no one is attacking.

  - glibc's UTF-8 converter is very slow for texts with many non-ASCII
    characters (tests b, e, h, i). I don't think we can do anything about it.
    I think why test h comes out twice as slow as test i is that the same
    text in Greek needs more characters than the same text in Chinese
    (every Hanzi character is worth 2 or more characters from an alphabet).

  - In the C locale (tests a, c, f), conversions of bytes < 0x80 are
    cheap, whereas conversions of bytes >= 0x80 are expensive, because
    in this code path, glibc returns (size_t)-1 and mbrtoc32.c invokes
    hard_locale.
    Can we optimize the need for calling hard_locale so often, somehow?
    Or create a variant of mbrtoc32 that fetches the value of hard_locale
    from some cache (maybe a __thread variable)?
    Or can hard_locale itself be optimized (through dirty, glibc specific
    hacks)?

I do *not* see a performance problem with character encodings such as
ISO-8859-7 or GB18030 (tests g, j): the figures are comparable with UTF-8.

> I timed mbcel on the Emacs source code and it scanned the input 
> significantly faster than mbiter did.

How can this be? The Emacs source code is mostly ASCII, and the figures
above (test a, b) show that for this case, mbiter is well optimized.

> I'm thinking that mbcel would be useful in Gnulib and in other GNU 
> programs, and that we should create a mbcel module for it in Gnulib.

I'd better try to copy the worthy optimizations into mbiter, mbuiter.
The reason is that mbcel is not defining a new abstraction; it is thus
somewhere in between a standard mbrtoc32 and an mbiter_multi_next
invocation, and it would become more difficult to choose the right one
if there are three similar interfaces.

Candidates for optimization:

- The C locale handling
  https://sourceware.org/bugzilla/show_bug.cgi?id=19932
  https://sourceware.org/bugzilla/show_bug.cgi?id=29511
  It's now a clear POSIX violation. Would it make sense to get this fixed
  in glibc, so that gnulib's override can be dropped on future glibc
  versions?
  To me, that would seem like a better approach than to have applications
  declare whether they insist on a POSIX compliant mbrtowc or not.

- Is a functional interface faster than one that gets a 'struct' passed
  by reference? I would guess no, since gcc optimizes both cases well,
  especially when inlining. But feel free to prove me wrong.

- Resetting an mbstate_t: Should we define a function
     void mbszero (mbstate_t *);
  that clears the relevant part of an mbstate_t (i.e. 24 bytes instead
  of 128 bytes on BSD systems)?
  Advantage: performance.
  Drawback: Yet another gnulib-invented, nonstandard API.

Bruno

Re: mbcel module for Gnulib?

Reply via email to