https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86419

--- Comment #10 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I was wrong in comment #9. The bug and the proposed fix are ok in comment #7.

While writing some tests for error I discovered yet another bug in UTF-8
decoding. See the example:

// 2 code points, both are 4 byte in UTF-8.
const char u8in[] = u8"\U0010FFFF\U0010AAAA";
const char32_t u32in[] = U"\U0010FFFF\U0010AAAA";

void
utf8_to_utf32_in_error_7 (const codecvt<char32_t, char, mbstate_t> &cvt)
{
  char in[7] = {};
  char32_t out[3] = {};
  char_traits<char>::copy (in, u8in, 7);
  in[5] = 'z';
  // Last CP has two errors. Its second code unit is malformed and it
  // misses its last code unit. Because it misses  its last CU, the
  // decoder return too early that it is incomplete.
  // It should return invalid.

  auto state = mbstate_t{};
  auto in_next = (const char *) nullptr;
  auto out_next = (char32_t *) nullptr;
  auto res = codecvt_base::result ();

  res = cvt.in (state, in, in + 7, in_next, out, out + 3, out_next);
  VERIFY (res == cvt.error); //incorrectly returns partial
  VERIFY (in_next == in + 4);
  VERIFY (out_next == out + 1);
  VERIFY (out[0] == u32in[0] && out[1] == 0 && out[2] == 0);
}

I published the full testsuite on Github, licensed under GPL v3+ of course.
https://github.com/dimztimz/codecvt_test/blob/master/codecvt.cpp . I was
thinking of sending a patch, but after this last bug, 4th, I see this needs
more time. Maybe a testsuite from another library like ICU can be incorporated?
Well, whatever, I will pause my work on this.

Reply via email to