Re: [committed V3] libstdc++: Add Unicode-aware width estimation for std::format

Jonathan Wakely Mon, 08 Jan 2024 14:56:40 -0800

On Mon, 8 Jan 2024 at 01:19, Jonathan Wakely <jwak...@redhat.com> wrote:
>
> I decided to push this now, not wait for the morning.
>
> This is mostly the same as V2, but adds to the contrib/unicode/README as
> suggested by Lewis, and avoids a trailing whitespace character in the
> generated header.
>
> Tested x86_64-linux and aarch64-linux. Pushed to trunk.
>
> -- >8 --
>
>
> This implements the requirements in the following proposals, which
> dictate how std::format deals with non-ASCII strings:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf
>
> There are two parts to this. The width estimation for strings must only
> count the width of the first character in an extended grapheme cluster.
> That requires implementing the algorithm for detecting cluster breaks,
> which requires a number of lookup tables of the grapheme cluster break
> properties (and Indic_Conjunct_Break and Extended_Pictographic
> properties) of every code point. Additionally, some characters have a
> field width of 2, which requires another lookup table of field widths
> for every code point.  The tables added in this commit do not contain
> entries for every code point from 0 to 0x10FFFF as that would be very
> inefficient and use too much memory. Instead the tables only contain the
> code points that form an "edge" for a property, omitting all the code
> points that have the same property as the preceding one. We can use a
> binary search to find the closest code point in the table that is not
> greater than the one we're looking for.
>
> The tables are generated by a new Python script added to the
> contrib/unicode directory, and a new data file downloaded from the
> Unicode Consortium website.
>
> The rules for extended grapheme cluster breaking are implemented for the
> latest Unicode standard, version 15.1.0.
>
> libstdc++-v3/ChangeLog:
>
>         * include/Makefile.am: Add new headers.
>         * include/Makefile.in: Regenerate.
>         * include/bits/unicode.h: New file.
>         * include/bits/unicode-data.h: New file.
>         * include/std/format: Include <bits/unicode.h>.
>         (__literal_encoding_is_utf8): Move to <bits/unicode.h>.
>         (_Spec::_M_fill): Change type to char32_t.
>         (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value
>         instead of a single character.
>         (__write_padded): Change __fill_char parameter to char32_t and
>         encode it into the output.
>         (__formatter_str::format): Use new __unicode::__field_width and
>         __unicode::__truncate functions.
>         * include/std/ostream: Adjust namespace qualification for
>         __literal_encoding_is_utf8.
>         * include/std/print: Likewise.
>         * src/c++23/print.cc: Add [[unlikely]] attribute to error path.
>         * testsuite/ext/unicode/view.cc: New test.
>         * testsuite/std/format/functions/format.cc: Add missing examples
>         from the standard demonstrating alignment with non-ASCII
>         characters. Add examples checking correct handling of extended
>         grapheme clusters.
>
> contrib/ChangeLog:
>
>         * unicode/README: Add notes about generating libstdc++ tables.
>         * unicode/GraphemeBreakProperty.txt: New file.
>         * unicode/emoji-data.txt: New file.
>         * unicode/gen_libstdcxx_unicode_data.py: New file.
> ---



While writing some more tests I realised I'd forgotten to finish this
function, and had left it as a copy&paste from __field_width(char32_t)
above:

> +  constexpr bool
> +  __is_extended_pictographic(char32_t __c)
> +  {
> +    if (__c < __xpicto_edges[0]) [[likely]]
> +      return 1;
> +
> +    auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), 
> __c);
> +    return (__p - __xpicto_edges) % 2 + 1;
> +  }

It should be:

  constexpr bool
  __is_extended_pictographic(char32_t __c)
  {
    if (__c < __xpicto_edges[0]) [[likely]]
      return false;

    auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c);
    return (__p - __xpicto_edges) % 2;
  }

I'll push a fix for that (and add my new tests) tomorrow.

Re: [committed V3] libstdc++: Add Unicode-aware width estimation for std::format

Reply via email to