Re: mbcel module for Gnulib?, incomplete multibyte sequences

Paul Eggert Thu, 27 Jul 2023 12:47:18 -0700

On 2023-07-24 17:01, Bruno Haible wrote:

biterf-bench-tests mbrtoc32-regular", together
with your mbcel benchmark patch from
https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00064.html
modified to use mbszero() (since some of the assumptions regarding
mbstate_t turned out to be unsafe)

Since that patch I fixed mbcel to account for the unsafe assumptions youmentioned, so mbcel shouldn't need to use mbszero any more.

The only platforms where this might matter are BSDish platforms whichare lower priority for performance tuning anyway. That being said,performance seems to be OK on these platforms; see below.

The fact that here mbsinit shows up despite NDEBUG, indicates that you
haven't had mbrtoc32-regular included your testdir.

Ah, fair enough, I did the test over again with current Gnulib (commit042a7042ac582e86693326943979b737c237a214) by running:


  gnulib-tool --create-testdir --dir a \
    mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular

then applying the attached patch, and running:

  gltests/bench-mbcel abcdefghij 1000000

This was with Ubuntu 23.04 and circa-2021 Xeon W-1350. Here are the results:

       noop  mbiterf mbuiterf    mbcel   mbucel
 a    1.546    2.038    1.931    1.639    1.968
 b    1.545    2.016    1.936    1.705    1.969
 c    2.153    8.160    8.255    6.160    5.550
 d    2,153    6,692    6,686    6,322    5,520
 e    2.234    6.866    6.817    6.214    5.781
 f    1.843   95.276  104.270   58.667   60.902
 g    1,844   76,126   83,899   65,541   66,983
 h    3.297   75.727   83.735   64.261   65.686
 i    1.941   31.064   34.887   26.262   27.072
 j    1.310   29.620   33.698   24.751   25.550

In some sense these benchmarks are biased against mbiter, as one can seefurther performance improvements to it (e.g., glibc fixing the C locale)that are fairly easy whereas mbcel will be harder to tune (e.g., improveGCC's code generation on x86-64).

But in another sense these benchmarks are biased against mbcel, as thisbenchmark compiles in a special environment that defines NDEBUG, whichmost Gnulib-using code doesn't do, and that gives mbiter an unfairadvantage.

Furthermore, microbenchmarks like these don't capture the issue ofmbiter/mbcel in more realistic applications, where other factors comeinto play. In a larger application mbiter's larger code footprint ismore likely to exhaust the instruction cache, which thesemicrobenchmarks don't measure.

Regarding the "is faster", I guess that Solaris and *BSD platforms will
comes out faster with mbu?iterf.

I guess they're about the same in practice. Here is the same benchmark(except with a smaller count, 100000), run on cfarm106.cfarm.net, whichis Oracle Solaris 11.4 running on a circa 2011 SPARC T4-2.


       noop  mbiterf mbuiterf    mbcel   mbucel
 a    0.281    0.722    0.833    0.831    0.827
 b    0.280    0.701    0.826    0.822    0.826
 c    0.391    3.130    3.926    3.324    3.430
 d    0,391    3,470    4,306    3,678    3,791
 e    0.405    3.943    4.899    4.102    4.259
 f    0.338   38.610   49.275   38.868   40.131
 Skipping test: locale el_GR.ISO-8859-7 not installed.
 h    0.597   53.185   67.347   53.443   54.541
 i    0.352   23.186   29.628   23.484   23.707
 j    0.237   25.194   30.781   25.286   25.826

With this in mind mbcel performance should be good enough on thesesecondary porting targets. As mbcel usage is easier to understand andless likely to have screwups, it seems like a win.

   mbscasecmp (...)
   {
     #if _GL_MBSTATE_ZERO_SIZE >= 12 /* BSD, Solaris */
     ... mbiterf based implementation ...
     #else
     ... mbcel based implementation ...
     #endif
   }

Now, regarding the claim "the two versions have the same behavior":

Can you prove it? I mean, mathematically prove it?

Oh, you're right, Although I had assumed a single-byte locale or UTF-8,when inputs contain encoding errors the two implementations are notequivalent for multi-byte encodings like GB18030 that are not ASCII-safe.

Since there is a difference between the two, I prefer mbcel'ssingle-byte per encoding-error interpretation (let's call it "SEE") tombiter's multi-byte-per-encoding-error interpretation ("MEE"), as SEE issimpler and is common practice elsewhere, notably Emacs. For example:

printf -- '-*- coding: gb18030 -*-\n\201\302\n\20109\n\2010\2119\n'>GB18030.txt

  LC_ALL=zh_CN.gb18030 emacs GB18030.txt

printf -- '-*- coding: utf-8 -*-\n\344\274\256\n\20109\n\303\242\n'>UTF-8.txt

  LC_ALL=zh_CN.utf8 emacs UTF-8.txt

Both instances of Emacs display this:

  伮
  \20109
  â

where the first line is U+4F2E CJK IDEOGRAPH-4F2E, the second line's"\201" is colored differently to represent an encoding error, and thelast line is U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX. Emacs here istaking the SEE interpretation, not the MEE interpretation where (inGB18030 only) the \201 byte followed by an ASCII "0" and ASCII "9"should be treated as an encoding error followed by "9", not as anencoding error followed by "0" and then "9".

The SEE/MEE distinction is a relatively minor issue in practice, asthere is no difference in UTF-8, nowadays few people use legacymulti-byte encodings, and these few people don't care what mbscasecmpdoes with encoding errors so long as it is a valid ordering relation(which is true for both SEE and MEE).

For applications that need MEE, it'd be easy for mbcel to supply it.Something like the following patch would do it, though it should be adifferent function (e.g., mbcel_scanmee) instead of being a change tombcel_scan itself. However, diffutils doesn't need to bother with such achange as it can use SEE like most other programs do.


 --- a/lib/mbcel.h
 +++ b/lib/mbcel.h
 @@ -191,3 +191,3 @@ mbcel_scan (char const *p, char const *lim)
    if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
 -    return (mbcel_t) { .err = *p, .len = 1 };

+ return (mbcel_t) { .err = *p, .len = len == (size_t) -2 ? lim - p: 1 };

diff -ruN a/gllib/Makefile.am b/gllib/Makefile.am
--- a/gllib/Makefile.am	2023-07-26 13:00:08.560745734 -0700
+++ b/gllib/Makefile.am	2023-07-26 13:10:16.914899799 -0700
@@ -3491,6 +3491,7 @@
 
 ## end   gnulib module yield
 
+libgnu_a_SOURCES += mbcel.c mbcel.h
 
 mostlyclean-local: mostlyclean-generic
 	@for dir in '' $(MOSTLYCLEANDIRS); do \
diff -ruN a/gllib/mbcel.c b/gllib/mbcel.c
--- a/gllib/mbcel.c	1969-12-31 16:00:00.000000000 -0800
+++ b/gllib/mbcel.c	2023-07-26 13:04:35.171885142 -0700
@@ -0,0 +1,3 @@
+#include <config.h>
+#define MBCEL_INLINE _GL_EXTERN_INLINE
+#include "mbcel.h"
diff -ruN a/gllib/mbcel.h b/gllib/mbcel.h
--- a/gllib/mbcel.h	1969-12-31 16:00:00.000000000 -0800
+++ b/gllib/mbcel.h	2023-07-27 09:25:00.197917147 -0700
@@ -0,0 +1,266 @@
+/* Multi-byte characters, error encodings, and lengths
+   Copyright 2023 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation; either version 3 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Paul Eggert.  */
+
+/* The mbcel_scan function lets code iterate through an array of bytes,
+   supporting character encodings in practical use
+   more simply than using plain mbrtoc32.
+
+   Instead of this single-byte code:
+
+      char *p = ..., *lim = ...;
+      for (; p < lim; p++)
+        process (*p);
+
+   You can use this multi-byte code:
+
+      char *p = ..., *lim = ...;
+      for (mbcel_t g; p < lim; p += g.len)
+        {
+	  g = mbcel_scan (p, lim);
+	  process (g);
+	}
+
+   You can select from G using G.ch, G.err, and G.len.
+
+   The mbcel_scanz function is similar except it works with a
+   string of unknown length that is terminated with '\0'.
+   Instead of this single-byte code:
+
+      char *p = ...;
+      for (; *p; p++)
+	process (*p);
+
+   You can use this multi-byte code:
+
+      char *p = ...;
+      for (mbcel_t g; *p; p += g.len)
+	{
+	  g = mbcel_scanz (p);
+	  process (g);
+	}
+
+   mbcel_scant (P, TERMINATOR) is like mbcel_scanz (P) except the
+   string is terminated by TERMINATOR.  The TERMINATORs '\0', '\r',
+   '\n', '.', '/' are safe, as they cannot be a part (even a trailing
+   byte) of a multi-byte character.
+
+   mbcel_cmp (G1, G2) and mbcel_casecmp (G1, G2) compare two mbcel_t
+   values lexicographically by character or by encoding byte value,
+   with encoding bytes sorting after characters.  mbcel_casecmp
+   ignores case in characters.  mbcel_strcasecmp compares two
+   null-terminated strings lexicographically.
+
+   Although ISO C and POSIX allow encodings that have shift states or
+   that can produce multiple characters from an indivisible byte sequence,
+   POSIX does not require support for these encodings,
+   they are not in practical use on GNUish platforms,
+   and omitting support for them simplifies the API.  */
+
+#ifndef _MBCEL_H
+#define _MBCEL_H 1
+
+/* This file uses _GL_INLINE_HEADER_BEGIN, _GL_INLINE,
+   _GL_ATTRIBUTE_MAY_ALIAS.  */
+#if !_GL_CONFIG_H_INCLUDED
+ #error "Please include config.h first."
+#endif
+
+#include <limits.h>
+#include <stddef.h>
+#include <uchar.h>
+
+/* The maximum multibyte character length supported on any platform.
+   This can be less than MB_LEN_MAX because many platforms have a
+   large MB_LEN_MAX to allow for stateful encodings, and mbcel does
+   not need to support these encodings.  MBCEL_LEN_MAX is enough for
+   UTF-8, EUC, Shift-JIS, GB18030, etc.
+   0 < MB_CUR_MAX <= MBCEL_LEN_MAX <= MB_LEN_MAX.  */
+enum { MBCEL_LEN_MAX = MB_LEN_MAX < 4 ? MB_LEN_MAX : 4 };
+
+/* mbcel_t is a type representing a character CH or an encoding error byte ERR,
+   along with a count of the LEN bytes that represent CH or ERR.
+   If ERR is zero, CH is a valid character and 1 <= LEN <= MB_LEN_MAX;
+   otherwise ERR is an encoding error byte, 0x80 <= ERR <= UCHAR_MAX,
+   CH == 0, and LEN == 1.  */
+typedef struct
+{
+  char32_t ch;
+  unsigned char err;
+  unsigned char len;
+} mbcel_t;
+
+/* On all known platforms, every multi-byte character length fits in
+   mbcel_t's LEN.  Check this.  */
+static_assert (MB_LEN_MAX <= UCHAR_MAX);
+
+/* Pacify GCC re '*p <= 0x7f' below.  */
+#if defined __GNUC__ && 4 < __GNUC__ + (3 <= __GNUC_MINOR__)
+# pragma GCC diagnostic ignored "-Wtype-limits"
+#endif
+
+_GL_INLINE_HEADER_BEGIN
+#ifndef MBCEL_INLINE
+# define MBCEL_INLINE _GL_INLINE
+#endif
+
+/* With mbcel there should be no need for the performance overhead of
+   replacing glibc mbrtoc32, as callers shouldn't care whether the
+   C locale treats a byte with the high bit set as an encoding error.  */
+#ifdef __GLIBC__
+# undef mbrtoc32
+#endif
+
+/* Shifting an encoding error byte (which must be at least 2**7)
+   left by 14 yields at least 2**21 (0x200000), which is greater
+   than the maximum Unicode value 0x10FFFF.  This suffices to sort
+   encoding errors after characters.  */
+enum { MBCEL_ENCODING_ERROR_SHIFT = 14 };
+
+/* In the typical case where unsigned char easily fits in int,
+   optimizations are possible.  */
+enum {
+  MBCEL_UCHAR_FITS = UCHAR_MAX <= INT_MAX,
+  MBCEL_UCHAR_EASILY_FITS = UCHAR_MAX <= INT_MAX >> MBCEL_ENCODING_ERROR_SHIFT
+};
+
+#ifndef _GL_LIKELY
+/* Rely on __builtin_expect, as provided by the module 'builtin-expect'.  */
+# define _GL_LIKELY(cond) __builtin_expect ((cond), 1)
+# define _GL_UNLIKELY(cond) __builtin_expect ((cond), 0)
+#endif
+
+/* Scan bytes from P inclusive to LIM exclusive.  P must be less than LIM.
+   Return either the valid character starting at P,
+   or the encoding error of length 1 at P.  */
+MBCEL_INLINE mbcel_t
+mbcel_scan (char const *p, char const *lim)
+{
+  /* Handle ASCII quickly to avoid the overhead of calling mbrtoc32.
+     In supported encodings, the first byte of a multi-byte character
+     cannot be an ASCII byte.  */
+  if (_GL_LIKELY (0 <= *p && *p <= 0x7f))
+    return (mbcel_t) { .ch = *p, .len = 1 };
+
+  /* An initial mbstate_t; initialization optimized for some platforms.
+     For details about these and other platforms, see wchar.in.h.  */
+#if defined __GLIBC__ && 2 < __GLIBC__ + (2 <= __GLIBC_MINOR__)
+  /* Although only a trivial optimization, it's worth it for GNU.  */
+  mbstate_t mbs; mbs.__count = 0;
+#elif (defined __FreeBSD__ || defined __DragonFly__ || defined __OpenBSD__ \
+       || (defined __APPLE__ && defined __MACH__))
+  /* These platforms have 128-byte mbstate_t.  What were they thinking?
+     Initialize just for supported encodings (UTF-8, EUC, etc.).
+     Avoid memset because some compilers generate function call code.  */
+  struct mbhidden { char32_t ch; int utf8_want, euc_want; }
+    _GL_ATTRIBUTE_MAY_ALIAS;
+  union { mbstate_t m; struct mbhidden s; } u;
+  u.s.ch = u.s.utf8_want = u.s.euc_want = 0;
+# define mbs u.m
+#elif defined __NetBSD__
+  /* Experiments on both 32- and 64-bit NetBSD platforms have
+     shown that it doesn't work to clear fewer than 24 bytes.  */
+  struct mbhidden { long long int a, b, c; } _GL_ATTRIBUTE_MAY_ALIAS;
+  union { mbstate_t m; struct mbhidden s; } u;
+  u.s.a = u.s.b = u.s.c = 0;
+# define mbs u.m
+#else
+  /* mbstate_t has unknown structure or is not worth optimizing.  */
+  mbstate_t mbs = {0};
+#endif
+
+  char32_t ch;
+  size_t len = mbrtoc32 (&ch, p, lim - p, &mbs);
+
+  /* Any LEN with top bit set is an encoding error, as LEN == (size_t) -3
+     is not supported and MB_LEN_MAX is small.  */
+  if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
+    return (mbcel_t) { .err = *p, .len = 1 };
+
+  /* Tell the compiler LEN is at most MB_LEN_MAX,
+     as this can help GCC generate better code.  */
+  if (! (len <= MB_LEN_MAX))
+    unreachable ();
+
+  /* A multi-byte character.  LEN must be positive,
+     as *P != '\0' and shift sequences are not supported.  */
+  return (mbcel_t) { .ch = ch, .len = len };
+}
+
+/* Scan bytes from P, a byte sequence terminated by TERMINATOR.
+   If *P == TERMINATOR, scan just that byte; otherwise scan
+   bytes up to but not including a TERMINATOR byte.
+   TERMINATOR must be ASCII, and should be '\0', '\r', '\n', '.', or '/'.
+   Return either the valid character starting at P,
+   or the encoding error of length 1 at P.  */
+MBCEL_INLINE mbcel_t
+mbcel_scant (char const *p, char terminator)
+{
+  /* Handle ASCII quickly for speed.  */
+  if (_GL_LIKELY (0 <= *p && *p <= 0x7f))
+    return (mbcel_t) { .ch = *p, .len = 1 };
+
+  /* Defer to mbcel_scan for non-ASCII.  Compute length with code that
+     is typically branch-free and faster than memchr or strnlen.  */
+  char const *lim = p + 1;
+  for (int i = 0; i < MBCEL_LEN_MAX - 1; i++)
+    lim += *lim != terminator;
+  return mbcel_scan (p, lim);
+}
+
+/* Scan bytes from P, a byte sequence terminated by '\0'.
+   If *P == '\0', scan just that byte; otherwise scan
+   bytes up to but not including a '\0'.
+   Return either the valid character starting at P,
+   or the encoding error of length 1 at P.  */
+MBCEL_INLINE mbcel_t
+mbcel_scanz (char const *p)
+{
+  return mbcel_scant (p, '\0');
+}
+
+/* Compare G1 and G2, with encoding errors sorting after characters.
+   Return <0, 0, >0 for <, =, >.  */
+MBCEL_INLINE int
+mbcel_cmp (mbcel_t g1, mbcel_t g2)
+{
+  int c1 = g1.ch, c2 = g2.ch, e1 = g1.err, e2 = g2.err, ccmp = c1 - c2,
+    ecmp = MBCEL_UCHAR_EASILY_FITS ? e1 - e2 : _GL_CMP (e1, e2);
+  return (ecmp << MBCEL_ENCODING_ERROR_SHIFT) + ccmp;
+}
+
+/* Compare G1 and G2 ignoring case, with encoding errors sorting after
+   characters.  Return <0, 0, >0 for <, =, >.  */
+MBCEL_INLINE int
+mbcel_casecmp (mbcel_t g1, mbcel_t g2)
+{
+  int cmp = mbcel_cmp (g1, g2);
+  if (_GL_LIKELY (g1.err | g2.err | !cmp))
+    return cmp;
+  int c1 = c32tolower (g1.ch);
+  int c2 = c32tolower (g2.ch);
+  return c1 - c2;
+}
+
+/* Compare the multi-byte strings S1 and S2 lexicographically, ignoring case.
+   Return <0, 0, >0 for <, =, >.  Consider encoding errors to be
+   greater than characters and compare them byte by byte.  */
+int mbcel_strcasecmp (char const *s1, char const *s2);
+
+_GL_INLINE_HEADER_END
+
+#endif /* _MBCEL_H */
diff -ruN a/gltests/bench-mbcel.c b/gltests/bench-mbcel.c
--- a/gltests/bench-mbcel.c	1969-12-31 16:00:00.000000000 -0800
+++ b/gltests/bench-mbcel.c	2023-07-27 09:30:10.736386027 -0700
@@ -0,0 +1,259 @@
+/* Benchmarks mbiterf, mbuiterf and mbcel.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <locale.h>
+#include <uchar.h>
+
+#include "bench.h"
+#include "bench-multibyte.h"
+#include "mbiterf.h"
+#include "mbuiterf.h"
+#include "mbcel.h"
+
+typedef unsigned long long (*test_function) (char const *, char const *, int);
+
+static unsigned long long
+noop_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    {
+      const char *iter;
+      for (iter = text; iter < text_end; iter++)
+        sum += (uintptr_t) iter;
+    }
+
+  return sum;
+}
+
+static unsigned long long
+mbiterf_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    {
+      mbif_state_t state;
+      const char *iter;
+      for (mbif_init (state), iter = text; mbif_avail (state, iter, text_end); )
+        {
+          mbchar_t cur = mbif_next (state, iter, text_end);
+          sum += cur.wc;
+          iter += mb_len (cur);
+        }
+    }
+
+  return sum;
+}
+
+static unsigned long long
+mbuiterf_test (char const *text, _GL_UNUSED char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    {
+      mbuif_state_t state;
+      const char *iter;
+      for (mbuif_init (state), iter = text; mbuif_avail (state, iter); )
+        {
+          mbchar_t cur = mbuif_next (state, iter);
+          sum += cur.wc;
+          iter += mb_len (cur);
+        }
+    }
+
+  return sum;
+}
+
+static unsigned long long
+mbcel_test (char const *text, char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    {
+      char const *iter = text;
+      for (mbcel_t g; iter < text_end; iter += g.len)
+        {
+          g = mbcel_scan (iter, text_end);
+          sum += g.ch;
+        }
+    }
+
+  return sum;
+}
+
+static unsigned long long
+mbucel_test (char const *text, _GL_UNUSED char const *text_end, int repeat)
+{
+  unsigned long long sum = 0;
+
+  for (int count = 0; count < repeat; count++)
+    {
+      const char *iter = text;
+      for (mbcel_t g; *iter; iter += g.len)
+        {
+          g = mbcel_scanz (iter);
+          sum += g.ch;
+        }
+    }
+
+  return sum;
+}
+
+static unsigned long long
+do_1_test (test_function test, char const *text,
+           char const *text_end, int repeat, struct timings_state *ts)
+{
+  timing_start (ts);
+  unsigned long long sum = test (text, text_end, repeat);
+  timing_end (ts);
+  return sum;
+}
+
+static void
+do_test (char test, int repeat, const char *locale_name,
+         const char *text)
+{
+  if (setlocale (LC_ALL, locale_name) != NULL)
+    {
+      size_t text_len = strlen (text);
+      const char *text_end = text + text_len;
+
+      static struct
+      {
+        char const *name;
+        test_function fn;
+        struct timings_state ts;
+        unsigned long long volatile sum;
+      } testdesc[] = {
+        { "noop", noop_test, {0} },
+        { "mbiterf", mbiterf_test, {0} },
+        { "mbuiterf", mbuiterf_test, {0} },
+        { "mbcel", mbcel_test, {0} },
+        { "mbucel", mbucel_test, {0} },
+      };
+      int ntestdesc = sizeof testdesc / sizeof *testdesc;
+      for (int i = 0; i < ntestdesc; i++)
+        testdesc[i].sum =
+          do_1_test (testdesc[i].fn, text, text_end, repeat, &testdesc[i].ts);
+
+      static bool header_printed;
+      if (!header_printed)
+        {
+          printf (" ");
+          for (int i = 0; i < ntestdesc; i++)
+            printf (" %8s", testdesc[i].name);
+          printf ("\n");
+          header_printed = true;
+        }
+
+      printf ("%c", test);
+      for (int i = 0; i < ntestdesc; i++)
+        {
+          double user_usec = testdesc[i].ts.user_usec;
+          double sys_usec = testdesc[i].ts.sys_usec;
+          printf (" %8.3f", (user_usec + sys_usec) / 1e6);
+        }
+      printf ("\n");
+    }
+  else
+    {
+      printf ("Skipping test: locale %s not installed.\n", locale_name);
+    }
+}
+
+/* Performs some or all of the following tests:
+     a - ASCII text, C locale
+     b - ASCII text, UTF-8 locale
+     c - French text, C locale
+     d - French text, ISO-8859-1 locale
+     e - French text, UTF-8 locale
+     f - Greek text, C locale
+     g - Greek text, ISO-8859-7 locale
+     h - Greek text, UTF-8 locale
+     i - Chinese text, UTF-8 locale
+     j - Chinese text, GB18030 locale
+   Pass the tests to be performed as first argument.  */
+int
+main (int argc, char *argv[])
+{
+  if (argc != 3)
+    {
+      fprintf (stderr, "Usage: %s TESTS REPETITIONS\n", argv[0]);
+
+      fprintf (stderr, "Example: %s abcdefghij 100000\n", argv[0]);
+      exit (1);
+    }
+
+  const char *tests = argv[1];
+  int repeat = atoi (argv[2]);
+
+  text_init ();
+
+  /* Execute each test.  */
+  size_t i;
+  for (i = 0; i < strlen (tests); i++)
+    {
+      char test = tests[i];
+
+      switch (test)
+        {
+        case 'a':
+          do_test (test, repeat, "C", text_latin_ascii);
+          break;
+        case 'b':
+          do_test (test, repeat, "en_US.UTF-8", text_latin_ascii);
+          break;
+        case 'c':
+          do_test (test, repeat, "C", text_french_iso8859);
+          break;
+        case 'd':
+          do_test (test, repeat, "fr_FR.ISO-8859-1", text_french_iso8859);
+          break;
+        case 'e':
+          do_test (test, repeat, "en_US.UTF-8", text_french_utf8);
+          break;
+        case 'f':
+          do_test (test, repeat, "C", text_greek_iso8859);
+          break;
+        case 'g':
+          do_test (test, repeat, "el_GR.ISO-8859-7", text_greek_iso8859);
+          break;
+        case 'h':
+          do_test (test, repeat, "en_US.UTF-8", text_greek_utf8);
+          break;
+        case 'i':
+          do_test (test, repeat, "en_US.UTF-8", text_chinese_utf8);
+          break;
+        case 'j':
+          do_test (test, repeat, "zh_CN.GB18030", text_chinese_gb18030);
+          break;
+        default:
+          /* Ignore.  */
+          ;
+        }
+    }
+
+  return 0;
+}
diff -ruN a/gltests/Makefile.am b/gltests/Makefile.am
--- a/gltests/Makefile.am	2023-07-26 13:00:20.616693605 -0700
+++ b/gltests/Makefile.am	2023-07-26 13:07:32.591407265 -0700
@@ -1558,6 +1558,11 @@
 
 ## end   gnulib module wcwidth-tests
 
+noinst_PROGRAMS += bench-mbcel
+bench_mbcel_CPPFLAGS = $(AM_CPPFLAGS) -DNDEBUG
+bench_mbcel_LDADD = $(LDADD) $(LIBUNISTRING) $(SETLOCALE_LIB) $(MBRTOWC_LIB) $(LIBC32CONV)
+EXTRA_DIST += bench-mbcel.c bench-multibyte.h bench.h
+
 all: all-notice
 all-notice:
 	@echo '## ---------------------------------------------------- ##'

Re: mbcel module for Gnulib?, incomplete multibyte sequences

Reply via email to