Re: [PATCH, resend] Handle multibyte codepoint width properly

Bruno Haible Thu, 05 Apr 2012 12:17:13 -0700

Hi Vladimir,

> > - The functions __argp_get_display_len and add_length don't write to
> >   the memory delimited by 'beg', 'end', 'ptr, 'end'. Therefore it is good
> >   style to declare these parameters as being 'const char *' rather than
> >   'char *'. The general rule of thumb is: Use 'const char *' instead of
> >   'char *' as long as it does not lead to warnings with "gcc -Wall" and
> >   does not force you to introduce casts.
>
> I'll have to change return types to offset instead of pointer to avoid
> casts.


Yup, that is the right thing to do.

> > - The function __argp_get_display_len looks very similar to mbsnwidth(),
> >   from module 'mbswidth'. Could you use that function? One of the gnulib
> >   principles is to reuse code that is already in gnulib, where it makes 
> > sense.
> mbsnwidth returns plainly -1 in case of incorrect multibyte sequence.
> This is unfortunately not uncommon that the .mo claims incorrect
> encoding

mbsnwidth returns -1 in such a case only if the option MBSW_REJECT_INVALID
is passed as third argument. If you pass 0, mbsnwidth will not return -1;
instead, it will assume width 1 for every invalid byte or unprintable
character.

> Perhaps we can use an approximation one byte = one column in case of
> failure

Yes, that's the common assumption.

> > - The function __argp_get_display_len looks very similar to mbsnwidth(),
> Remaining is the issue due to escape sequences.

What is the use case? PO file editors are not required to support editing
of strings with control characters. msgfmt warns when a message in a PO file
contains an unusual control character like ESC.

> it is used in en@boldquot

Ah, right. But I don't know how frequently it is used; maybe I and Simon
were the only persons to ever use this? If we want to support this, not
only mbswidth has to be modified, but basically any code that uses
wcwidth - including libunistring. So, until this is discussed (and possibly
generalized to more languages than 'en'), I propose to get away without
it.

> Done but the test is valid only for UTF-8 locales. Should I force some
> specific locale? It's impossible to make a test working in all locales
> since in case of e.g. ASCII we don't have such characters at all.

In such a situation, it is best to split the test into two parts: a part
that can be executed on every machine, and a part which can only be executed
on a system with a UTF-8 locale. This way, the first part is not skipped
just because the system has no UTF-8 locale.

Please take a look how it's done in module 'mbsstr-tests':
  - test-mbsstr1.c is a test that doesn't need a particular locale.
  - test-mbsstr2.c is a test that requires a UTF-8 locale. We use the
    French one for simplicity. (If a system does not have fr_FR.UTF-8
    installed, it would be unlikely that it has ru_RU.UTF-8 installed.)
  - test-mbsstr2.sh is a wrapper script that uses the LOCALE_FR_UTF8
    value, determined by m4/locale-fr.m4, and invokes test-mbsstr2.

> I have the assignment covering any contribution to GRUB. So this should
> go pretty quickly.

Thanks for pursuing this.

> Alternatively I can commit it in GRUB and you'll lift it from there.

This would not be so ideal.

> +      if (wc == '\e' && ptr + 3 < end
> +          && ptr[1] == '[' && (ptr[2] == '0' || ptr[2] == '1')
> +          && ptr[3] == 'm')

'\e' is not portable, only GCC supports it. Use '\x1b' or '\033' instead.

Also, the test  ptr + 3 < end  is wrong. Should be written as
  end - ptr > 3
instead. (Think of ptr = 0xFFFFFFD, end = 0xFFFFFFFE on a 32-bit machine.)
Sure, on many systems this won't matter, because this memory range is
either unmapped or occupied by the stack. But in general you have no guarantee
that the memory page from 0xFFFFC000..0xFFFFFFFF will not be used for malloc().

Bruno

Re: [PATCH, resend] Handle multibyte codepoint width properly

Reply via email to