On Thu, 28 Sept 2023 at 20:39, Dimitrij Mijoski via Libstdc++ <[email protected]> wrote: > > This patch fixes the handling of surrogate code points in all standard > facets for transcoding Unicode that are based on std::codecvt. Surrogate > code points should always be treated as error. On the other hand > surrogate code units can only appear in UTF-16 and only when they come > in a proper pair. > > Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number > of bytes were given in the range [from, from_end), error was returned > always. The last byte in such range does not form a full UTF-16 code > unit and we can not make any decisions for error, instead partial should > be returned. > > The testsuite for testing these facets was updated in the following > order: > > 1. All functions that test codecvts that work with UTF-8 were refactored > and made more generic so they accept codecvt that works with the char > type char8_t. > 2. The same functions were updated with new test cases for transcoding > errors and now additionally test for surrogates, overlong UTF-8 > sequences, code points out of the Unicode range, and more tests for > missing leading and trailing code units. > 3. New tests were added to test codecvt_utf16 in both of its variants, > UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2. > > libstdc++-v3/ChangeLog: > > * src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of > surrogates in UTF-8. > (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8. > (ucs4_in): Fix handling of range with odd number of bytes. > (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16. > (ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16. > (ucs2_in): Fix handling of range with odd number of bytes. > (__codecvt_utf16_base<char16_t>::do_in): Likewise. > (__codecvt_utf16_base<char32_t>::do_in): Likewise. > (__codecvt_utf16_base<wchar_t>::do_in): Likewise. > * testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add > tests for codecvt_utf16<char16_t> and codecvt_utf16<char32_t>. > * testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8 > testing functions for char8_t, add more test cases for errors, > add testing functions for codecvt_utf16. > * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: > Renames, add tests for codecvt_utf16<whchar_t>. > * testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06): > Fix test. > * testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test.
Thanks, your v2 patch was still on my TODO list. I've pushed this version to trunk now.
