On Mon, 8 Jan 2024 at 01:19, Jonathan Wakely <jwak...@redhat.com> wrote: > > I decided to push this now, not wait for the morning. > > This is mostly the same as V2, but adds to the contrib/unicode/README as > suggested by Lewis, and avoids a trailing whitespace character in the > generated header. > > Tested x86_64-linux and aarch64-linux. Pushed to trunk. > > -- >8 -- > > > This implements the requirements in the following proposals, which > dictate how std::format deals with non-ASCII strings: > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf > > There are two parts to this. The width estimation for strings must only > count the width of the first character in an extended grapheme cluster. > That requires implementing the algorithm for detecting cluster breaks, > which requires a number of lookup tables of the grapheme cluster break > properties (and Indic_Conjunct_Break and Extended_Pictographic > properties) of every code point. Additionally, some characters have a > field width of 2, which requires another lookup table of field widths > for every code point. The tables added in this commit do not contain > entries for every code point from 0 to 0x10FFFF as that would be very > inefficient and use too much memory. Instead the tables only contain the > code points that form an "edge" for a property, omitting all the code > points that have the same property as the preceding one. We can use a > binary search to find the closest code point in the table that is not > greater than the one we're looking for. > > The tables are generated by a new Python script added to the > contrib/unicode directory, and a new data file downloaded from the > Unicode Consortium website. > > The rules for extended grapheme cluster breaking are implemented for the > latest Unicode standard, version 15.1.0. > > libstdc++-v3/ChangeLog: > > * include/Makefile.am: Add new headers. > * include/Makefile.in: Regenerate. > * include/bits/unicode.h: New file. > * include/bits/unicode-data.h: New file. > * include/std/format: Include <bits/unicode.h>. > (__literal_encoding_is_utf8): Move to <bits/unicode.h>. > (_Spec::_M_fill): Change type to char32_t. > (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value > instead of a single character. > (__write_padded): Change __fill_char parameter to char32_t and > encode it into the output. > (__formatter_str::format): Use new __unicode::__field_width and > __unicode::__truncate functions. > * include/std/ostream: Adjust namespace qualification for > __literal_encoding_is_utf8. > * include/std/print: Likewise. > * src/c++23/print.cc: Add [[unlikely]] attribute to error path. > * testsuite/ext/unicode/view.cc: New test. > * testsuite/std/format/functions/format.cc: Add missing examples > from the standard demonstrating alignment with non-ASCII > characters. Add examples checking correct handling of extended > grapheme clusters. > > contrib/ChangeLog: > > * unicode/README: Add notes about generating libstdc++ tables. > * unicode/GraphemeBreakProperty.txt: New file. > * unicode/emoji-data.txt: New file. > * unicode/gen_libstdcxx_unicode_data.py: New file. > ---
While writing some more tests I realised I'd forgotten to finish this function, and had left it as a copy&paste from __field_width(char32_t) above: > + constexpr bool > + __is_extended_pictographic(char32_t __c) > + { > + if (__c < __xpicto_edges[0]) [[likely]] > + return 1; > + > + auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), > __c); > + return (__p - __xpicto_edges) % 2 + 1; > + } It should be: constexpr bool __is_extended_pictographic(char32_t __c) { if (__c < __xpicto_edges[0]) [[likely]] return false; auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c); return (__p - __xpicto_edges) % 2; } I'll push a fix for that (and add my new tests) tomorrow.