Re: [PATCH] Various fixes for facets

Jonathan Wakely Fri, 17 Mar 2017 12:29:43 -0700

On 16/03/17 15:23 +0000, Jonathan Wakely wrote:

On 14/03/17 18:46 +0000, Jonathan Wakely wrote:

On 13/03/17 19:35 +0000, Jonathan Wakely wrote:

This is a series of patches to fix various bugs in the Unicode
character conversion facets.


Ther first patch fixes a silly < versus <= bug that meant that 0xffff
got written as a surrogate pair instead of as simply 0xff, and an
endianness bug for the internal representation of UTF-16 code units
stored in char32_t or wchar_t values. That's PR 79511.

The second patch fixes some incorrect bitwise operations (because I
confused & and |) and some incorrect limits (because I confused max
and min). That fixes determining the endianness of the external
representation bytes when they start with a Byte OrderMark, and
correctly reports errors on invalid UCS2. It also fixes
wstring_convert so that it reports the number of characters that were
converted prior to an error. That's PR 79980.

The third patch fixes the output of the encoding() and max_length()
member functions on the codecvt facets, because I wasn't correctly
accounting for a BOM or for the differences between UTF-16 and UCS2.

I plan to commit these for all branches, but I'll wait until after GCC
7.1 is released, and fix it for 7.2 instead. These bugs aren't
important enough to rush into trunk now.


One more patch for a problem found by the libc++ testsuite. Now we
pass all the libc++ tests, and we even pass a test that libc++ fails.
With this, I hope our <codecvt> is 100% conforming. Just in time to be
deprecated for C++17 :-)


I've committed these to trunk, on the basis that they're intended to
be backported to all branches anyway (fixing features that are
currently broken in all branches). There's no point waiting if we plan
to commit them anyway, it would just mean doing an extra backport (5,
6, 7 *and* 8).

Backports will be done soon.


I backported all the recent <codecvt> fixes to gcc-6-branch and it was
failing one test, due to unaligned reads in std::codecvt_utf16. That
type reads UTF-16 data from a const char* (Why narrow characters when
we have char16_t? Because <codecvt> likes to be awkward) and I was
doing that by casting the const char* to const char16_t*. That isn't
safe when the first char isn't aligned correctly for a char16_t.

This patch fixes all the unaligned accesses by abstracting the
operations on the pointers to use new overlaoded operators on the
range<Elem> type. A new partial specialization range<Elem, false>
uses memcpy to read/write char16_t values from the char*, avoiding
alignment problems. The primary template (range<Elem, true>) just
dereferences the pointers directly.

Tested x86_64-linux, powerpc64le-linux, powerpc64-linux,
powerpc-ibm-aix7.2.0.0 (which has 2-byte wchar_t).

Also tested with ubsan to confirm the unaligned accesses are gone.

Committed to trunk, gcc-6-branch, gcc-5-branch.

commit 96ebc791ce1bd9cbba913d0b25b60ee4a09c41f1
Author: Jonathan Wakely <jwak...@redhat.com>
Date:   Fri Mar 17 13:00:00 2017 +0000

    Fix alignment bugs in std::codecvt_utf16
    
    	* src/c++11/codecvt.cc (range): Add non-type template parameter and
    	define oerloaded operators for reading and writing code units.
    	(range<Elem, false>): Define partial specialization for accessing
    	wide characters in potentially unaligned byte ranges.
    	(ucs2_span(const char16_t*, const char16_t*, ...))
    	(ucs4_span(const char16_t*, const char16_t*, ...)): Change parameters
    	to range<const char16_t, false> in order to avoid unaligned reads.
    	(__codecvt_utf16_base<char16_t>::do_out)
    	(__codecvt_utf16_base<char32_t>::do_out)
    	(__codecvt_utf16_base<wchar_t>::do_out): Use range specialization for
    	unaligned data to avoid unaligned writes.
    	(__codecvt_utf16_base<char16_t>::do_in)
    	(__codecvt_utf16_base<char32_t>::do_in)
    	(__codecvt_utf16_base<wchar_t>::do_in): Likewise for writes. Return
    	error if there are unprocessable trailing bytes.
    	(__codecvt_utf16_base<char16_t>::do_length)
    	(__codecvt_utf16_base<char32_t>::do_length)
    	(__codecvt_utf16_base<wchar_t>::do_length): Pass arguments of type
    	range<const char16_t, false> to span functions.
    	* testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc: New test.

diff --git a/libstdc++-v3/src/c++11/codecvt.cc b/libstdc++-v3/src/c++11/codecvt.cc
index 02866ef..1187339 100644
--- a/libstdc++-v3/src/c++11/codecvt.cc
+++ b/libstdc++-v3/src/c++11/codecvt.cc
@@ -57,17 +57,104 @@ namespace
   const char32_t incomplete_mb_character = char32_t(-2);
   const char32_t invalid_mb_sequence = char32_t(-1);
 
-  template<typename Elem>
+  // Utility type for reading and writing code units of type Elem from
+  // a range defined by a pair of pointers.
+  template<typename Elem, bool Aligned = true>
     struct range
     {
       Elem* next;
       Elem* end;
 
+      // Write a code unit.
+      range& operator=(Elem e)
+      {
+	*next++ = e;
+	return *this;
+      }
+
+      // Read the next code unit.
       Elem operator*() const { return *next; }
 
-      range& operator++() { ++next; return *this; }
+      // Read the Nth code unit.
+      Elem operator[](size_t n) const { return next[n]; }
 
+      // Move to the next code unit.
+      range& operator++()
+      {
+	++next;
+	return *this;
+      }
+
+      // Move to the Nth code unit.
+      range& operator+=(size_t n)
+      {
+	next += n;
+	return *this;
+      }
+
+      // The number of code units remaining.
       size_t size() const { return end - next; }
+
+      // The number of bytes remaining.
+      size_t nbytes() const { return (const char*)end - (const char*)next; }
+    };
+
+  // This specialization is used when accessing char16_t values through
+  // pointers to char, which might not be correctly aligned for char16_t.
+  template<typename Elem>
+    struct range<Elem, false>
+    {
+      using value_type = typename remove_const<Elem>::type;
+
+      using char_pointer = typename
+	conditional<is_const<Elem>::value, const char*, char*>::type;
+
+      char_pointer next;
+      char_pointer end;
+
+      // Write a code unit.
+      range& operator=(Elem e)
+      {
+	memcpy(next, &e, sizeof(Elem));
+	++*this;
+	return *this;
+      }
+
+      // Read the next code unit.
+      Elem operator*() const
+      {
+	value_type e;
+	memcpy(&e, next, sizeof(Elem));
+	return e;
+      }
+
+      // Read the Nth code unit.
+      Elem operator[](size_t n) const
+      {
+	value_type e;
+	memcpy(&e, next + n * sizeof(Elem), sizeof(Elem));
+	return e;
+      }
+
+      // Move to the next code unit.
+      range& operator++()
+      {
+	next += sizeof(Elem);
+	return *this;
+      }
+
+      // Move to the Nth code unit.
+      range& operator+=(size_t n)
+      {
+	next += n * sizeof(Elem);
+	return *this;
+      }
+
+      // The number of code units remaining.
+      size_t size() const { return nbytes() / sizeof(Elem); }
+
+      // The number of bytes remaining.
+      size_t nbytes() const { return end - next; }
     };
 
   // Multibyte sequences can have "header" consisting of Byte Order Mark
@@ -75,17 +162,37 @@ namespace
   const unsigned char utf16_bom[2] = { 0xFE, 0xFF };
   const unsigned char utf16le_bom[2] = { 0xFF, 0xFE };
 
-  template<size_t N>
-    inline bool
-    write_bom(range<char>& to, const unsigned char (&bom)[N])
+  // Write a BOM (space permitting).
+  template<typename C, bool A, size_t N>
+    bool
+    write_bom(range<C, A>& to, const unsigned char (&bom)[N])
     {
-      if (to.size() < N)
+      static_assert( (N / sizeof(C)) != 0, "" );
+      static_assert( (N % sizeof(C)) == 0, "" );
+
+      if (to.nbytes() < N)
 	return false;
       memcpy(to.next, bom, N);
-      to.next += N;
+      to += (N / sizeof(C));
       return true;
     }
 
+  // Try to read a BOM.
+  template<typename C, bool A, size_t N>
+    bool
+    read_bom(range<C, A>& from, const unsigned char (&bom)[N])
+    {
+      static_assert( (N / sizeof(C)) != 0, "" );
+      static_assert( (N % sizeof(C)) == 0, "" );
+
+      if (from.nbytes() >= N && !memcmp(from.next, bom, N))
+	{
+	  from += (N / sizeof(C));
+	  return true;
+	}
+      return false;
+    }
+
   // If generate_header is set in mode write out UTF-8 BOM.
   bool
   write_utf8_bom(range<char>& to, codecvt_mode mode)
@@ -97,32 +204,20 @@ namespace
 
   // If generate_header is set in mode write out the UTF-16 BOM indicated
   // by whether little_endian is set in mode.
+  template<bool Aligned>
   bool
-  write_utf16_bom(range<char16_t>& to, codecvt_mode mode)
+  write_utf16_bom(range<char16_t, Aligned>& to, codecvt_mode mode)
   {
     if (mode & generate_header)
     {
-      if (!to.size())
-	return false;
-      auto* bom = (mode & little_endian) ? utf16le_bom : utf16_bom;
-      std::memcpy(to.next, bom, 2);
-      ++to.next;
+      if (mode & little_endian)
+	return write_bom(to, utf16le_bom);
+      else
+	return write_bom(to, utf16_bom);
     }
     return true;
   }
 
-  template<size_t N>
-    inline bool
-    read_bom(range<const char>& from, const unsigned char (&bom)[N])
-    {
-      if (from.size() >= N && !memcmp(from.next, bom, N))
-	{
-	  from.next += N;
-	  return true;
-	}
-      return false;
-    }
-
   // If consume_header is set in mode update from.next to after any BOM.
   void
   read_utf8_bom(range<const char>& from, codecvt_mode mode)
@@ -135,21 +230,16 @@ namespace
   // Otherwise, if *from.next is a UTF-16 BOM increment from.next and then:
   // - if the UTF-16BE BOM was found unset little_endian in mode, or
   // - if the UTF-16LE BOM was found set little_endian in mode.
+  template<bool Aligned>
   void
-  read_utf16_bom(range<const char16_t>& from, codecvt_mode& mode)
+  read_utf16_bom(range<const char16_t, Aligned>& from, codecvt_mode& mode)
   {
-    if (mode & consume_header && from.size())
+    if (mode & consume_header)
       {
-	if (!memcmp(from.next, utf16_bom, 2))
-	  {
-	    ++from.next;
-	    mode &= ~little_endian;
-	  }
-	else if (!memcmp(from.next, utf16le_bom, 2))
-	  {
-	    ++from.next;
-	    mode |= little_endian;
-	  }
+	if (read_bom(from, utf16_bom))
+	  mode &= ~little_endian;
+	else if (read_bom(from, utf16le_bom))
+	  mode |= little_endian;
       }
   }
 
@@ -162,11 +252,11 @@ namespace
     const size_t avail = from.size();
     if (avail == 0)
       return incomplete_mb_character;
-    unsigned char c1 = from.next[0];
+    unsigned char c1 = from[0];
     // https://en.wikipedia.org/wiki/UTF-8#Sample_code
     if (c1 < 0x80)
     {
-      ++from.next;
+      ++from;
       return c1;
     }
     else if (c1 < 0xC2) // continuation or overlong 2-byte sequence
@@ -175,51 +265,51 @@ namespace
     {
       if (avail < 2)
 	return incomplete_mb_character;
-      unsigned char c2 = from.next[1];
+      unsigned char c2 = from[1];
       if ((c2 & 0xC0) != 0x80)
 	return invalid_mb_sequence;
       char32_t c = (c1 << 6) + c2 - 0x3080;
       if (c <= maxcode)
-	from.next += 2;
+	from += 2;
       return c;
     }
     else if (c1 < 0xF0) // 3-byte sequence
     {
       if (avail < 3)
 	return incomplete_mb_character;
-      unsigned char c2 = from.next[1];
+      unsigned char c2 = from[1];
       if ((c2 & 0xC0) != 0x80)
 	return invalid_mb_sequence;
       if (c1 == 0xE0 && c2 < 0xA0) // overlong
 	return invalid_mb_sequence;
-      unsigned char c3 = from.next[2];
+      unsigned char c3 = from[2];
       if ((c3 & 0xC0) != 0x80)
 	return invalid_mb_sequence;
       char32_t c = (c1 << 12) + (c2 << 6) + c3 - 0xE2080;
       if (c <= maxcode)
-	from.next += 3;
+	from += 3;
       return c;
     }
     else if (c1 < 0xF5) // 4-byte sequence
     {
       if (avail < 4)
 	return incomplete_mb_character;
-      unsigned char c2 = from.next[1];
+      unsigned char c2 = from[1];
       if ((c2 & 0xC0) != 0x80)
 	return invalid_mb_sequence;
       if (c1 == 0xF0 && c2 < 0x90) // overlong
 	return invalid_mb_sequence;
       if (c1 == 0xF4 && c2 >= 0x90) // > U+10FFFF
       return invalid_mb_sequence;
-      unsigned char c3 = from.next[2];
+      unsigned char c3 = from[2];
       if ((c3 & 0xC0) != 0x80)
 	return invalid_mb_sequence;
-      unsigned char c4 = from.next[3];
+      unsigned char c4 = from[3];
       if ((c4 & 0xC0) != 0x80)
 	return invalid_mb_sequence;
       char32_t c = (c1 << 18) + (c2 << 12) + (c3 << 6) + c4 - 0x3C82080;
       if (c <= maxcode)
-	from.next += 4;
+	from += 4;
       return c;
     }
     else // > U+10FFFF
@@ -233,31 +323,31 @@ namespace
       {
 	if (to.size() < 1)
 	  return false;
-	*to.next++ = code_point;
+	to = code_point;
       }
     else if (code_point <= 0x7FF)
       {
 	if (to.size() < 2)
 	  return false;
-	*to.next++ = (code_point >> 6) + 0xC0;
-	*to.next++ = (code_point & 0x3F) + 0x80;
+	to = (code_point >> 6) + 0xC0;
+	to = (code_point & 0x3F) + 0x80;
       }
     else if (code_point <= 0xFFFF)
       {
 	if (to.size() < 3)
 	  return false;
-	*to.next++ = (code_point >> 12) + 0xE0;
-	*to.next++ = ((code_point >> 6) & 0x3F) + 0x80;
-	*to.next++ = (code_point & 0x3F) + 0x80;
+	to = (code_point >> 12) + 0xE0;
+	to = ((code_point >> 6) & 0x3F) + 0x80;
+	to = (code_point & 0x3F) + 0x80;
       }
     else if (code_point <= 0x10FFFF)
       {
 	if (to.size() < 4)
 	  return false;
-	*to.next++ = (code_point >> 18) + 0xF0;
-	*to.next++ = ((code_point >> 12) & 0x3F) + 0x80;
-	*to.next++ = ((code_point >> 6) & 0x3F) + 0x80;
-	*to.next++ = (code_point & 0x3F) + 0x80;
+	to = (code_point >> 18) + 0xF0;
+	to = ((code_point >> 12) & 0x3F) + 0x80;
+	to = ((code_point >> 6) & 0x3F) + 0x80;
+	to = (code_point & 0x3F) + 0x80;
       }
     else
       return false;
@@ -298,38 +388,39 @@ namespace
   // The sequence's endianness is indicated by (mode & little_endian).
   // Updates from.next if the codepoint is not greater than maxcode.
   // Returns invalid_mb_sequence, incomplete_mb_character or the code point.
-  char32_t
-  read_utf16_code_point(range<const char16_t>& from, unsigned long maxcode,
-			codecvt_mode mode)
-  {
-    const size_t avail = from.size();
-    if (avail == 0)
-      return incomplete_mb_character;
-    int inc = 1;
-    char32_t c = adjust_byte_order(from.next[0], mode);
-    if (is_high_surrogate(c))
-      {
-	if (avail < 2)
-	  return incomplete_mb_character;
-	const char16_t c2 = adjust_byte_order(from.next[1], mode);
-	if (is_low_surrogate(c2))
-	  {
-	    c = surrogate_pair_to_code_point(c, c2);
-	    inc = 2;
-	  }
-	else
-	  return invalid_mb_sequence;
-      }
-    else if (is_low_surrogate(c))
-      return invalid_mb_sequence;
-    if (c <= maxcode)
-      from.next += inc;
-    return c;
-  }
+  template<bool Aligned>
+    char32_t
+    read_utf16_code_point(range<const char16_t, Aligned>& from,
+			  unsigned long maxcode, codecvt_mode mode)
+    {
+      const size_t avail = from.size();
+      if (avail == 0)
+	return incomplete_mb_character;
+      int inc = 1;
+      char32_t c = adjust_byte_order(from[0], mode);
+      if (is_high_surrogate(c))
+	{
+	  if (avail < 2)
+	    return incomplete_mb_character;
+	  const char16_t c2 = adjust_byte_order(from[1], mode);
+	  if (is_low_surrogate(c2))
+	    {
+	      c = surrogate_pair_to_code_point(c, c2);
+	      inc = 2;
+	    }
+	  else
+	    return invalid_mb_sequence;
+	}
+      else if (is_low_surrogate(c))
+	return invalid_mb_sequence;
+      if (c <= maxcode)
+	from += inc;
+      return c;
+    }
 
-  template<typename C>
+  template<typename C, bool A>
   bool
-  write_utf16_code_point(range<C>& to, char32_t codepoint, codecvt_mode mode)
+  write_utf16_code_point(range<C, A>& to, char32_t codepoint, codecvt_mode mode)
   {
     static_assert(sizeof(C) >= 2, "a code unit must be at least 16-bit");
 
@@ -337,8 +428,7 @@ namespace
       {
 	if (to.size() > 0)
 	  {
-	    *to.next = adjust_byte_order(codepoint, mode);
-	    ++to.next;
+	    to = adjust_byte_order(codepoint, mode);
 	    return true;
 	  }
       }
@@ -348,9 +438,8 @@ namespace
 	const char32_t LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
 	char16_t lead = LEAD_OFFSET + (codepoint >> 10);
 	char16_t trail = 0xDC00 + (codepoint & 0x3FF);
-	to.next[0] = adjust_byte_order(lead, mode);
-	to.next[1] = adjust_byte_order(trail, mode);
-	to.next += 2;
+	to = adjust_byte_order(lead, mode);
+	to = adjust_byte_order(trail, mode);
 	return true;
       }
     return false;
@@ -369,7 +458,7 @@ namespace
 	  return codecvt_base::partial;
 	if (codepoint > maxcode)
 	  return codecvt_base::error;
-	*to.next++ = codepoint;
+	to = codepoint;
       }
     return from.size() ? codecvt_base::partial : codecvt_base::ok;
   }
@@ -383,19 +472,19 @@ namespace
       return codecvt_base::partial;
     while (from.size())
       {
-	const char32_t c = from.next[0];
+	const char32_t c = from[0];
 	if (c > maxcode)
 	  return codecvt_base::error;
 	if (!write_utf8_code_point(to, c))
 	  return codecvt_base::partial;
-	++from.next;
+	++from;
       }
     return codecvt_base::ok;
   }
 
   // utf16 -> ucs4
   codecvt_base::result
-  ucs4_in(range<const char16_t>& from, range<char32_t>& to,
+  ucs4_in(range<const char16_t, false>& from, range<char32_t>& to,
           unsigned long maxcode = max_code_point, codecvt_mode mode = {})
   {
     read_utf16_bom(from, mode);
@@ -406,26 +495,26 @@ namespace
 	  return codecvt_base::partial;
 	if (codepoint > maxcode)
 	  return codecvt_base::error;
-	*to.next++ = codepoint;
+	to = codepoint;
       }
     return from.size() ? codecvt_base::partial : codecvt_base::ok;
   }
 
   // ucs4 -> utf16
   codecvt_base::result
-  ucs4_out(range<const char32_t>& from, range<char16_t>& to,
+  ucs4_out(range<const char32_t>& from, range<char16_t, false>& to,
            unsigned long maxcode = max_code_point, codecvt_mode mode = {})
   {
     if (!write_utf16_bom(to, mode))
       return codecvt_base::partial;
     while (from.size())
       {
-	const char32_t c = from.next[0];
+	const char32_t c = from[0];
 	if (c > maxcode)
 	  return codecvt_base::error;
 	if (!write_utf16_code_point(to, c, mode))
 	  return codecvt_base::partial;
-	++from.next;
+	++from;
       }
     return codecvt_base::ok;
   }
@@ -443,7 +532,7 @@ namespace
     read_utf8_bom(from, mode);
     while (from.size() && to.size())
       {
-	const char* const first = from.next;
+	auto orig = from;
 	const char32_t codepoint = read_utf8_code_point(from, maxcode);
 	if (codepoint == incomplete_mb_character)
 	  {
@@ -456,7 +545,7 @@ namespace
 	  return codecvt_base::error;
 	if (!write_utf16_code_point(to, codepoint, mode))
 	  {
-	    from.next = first;
+	    from = orig; // rewind to previous position
 	    return codecvt_base::partial;
 	  }
       }
@@ -474,7 +563,7 @@ namespace
       return codecvt_base::partial;
     while (from.size())
       {
-	char32_t c = from.next[0];
+	char32_t c = from[0];
 	int inc = 1;
 	if (is_high_surrogate(c))
 	  {
@@ -484,7 +573,7 @@ namespace
 	    if (from.size() < 2)
 	      return codecvt_base::ok; // stop converting at this point
 
-	    const char32_t c2 = from.next[1];
+	    const char32_t c2 = from[1];
 	    if (is_low_surrogate(c2))
 	      {
 		c = surrogate_pair_to_code_point(c, c2);
@@ -499,7 +588,7 @@ namespace
 	  return codecvt_base::error;
 	if (!write_utf8_code_point(to, c))
 	  return codecvt_base::partial;
-	from.next += inc;
+	from += inc;
       }
     return codecvt_base::ok;
   }
@@ -548,27 +637,27 @@ namespace
 
   // ucs2 -> utf16
   codecvt_base::result
-  ucs2_out(range<const char16_t>& from, range<char16_t>& to,
+  ucs2_out(range<const char16_t>& from, range<char16_t, false>& to,
 	   char32_t maxcode = max_code_point, codecvt_mode mode = {})
   {
     if (!write_utf16_bom(to, mode))
       return codecvt_base::partial;
     while (from.size() && to.size())
       {
-	char16_t c = from.next[0];
+	char16_t c = from[0];
 	if (is_high_surrogate(c))
 	  return codecvt_base::error;
 	if (c > maxcode)
 	  return codecvt_base::error;
-	*to.next++ = adjust_byte_order(c, mode);
-	++from.next;
+	to = adjust_byte_order(c, mode);
+	++from;
       }
     return from.size() == 0 ? codecvt_base::ok : codecvt_base::partial;
   }
 
   // utf16 -> ucs2
   codecvt_base::result
-  ucs2_in(range<const char16_t>& from, range<char16_t>& to,
+  ucs2_in(range<const char16_t, false>& from, range<char16_t>& to,
 	  char32_t maxcode = max_code_point, codecvt_mode mode = {})
   {
     read_utf16_bom(from, mode);
@@ -581,23 +670,22 @@ namespace
 	  return codecvt_base::error; // UCS-2 only supports single units.
 	if (c > maxcode)
 	  return codecvt_base::error;
-	*to.next++ = c;
+	to = c;
       }
     return from.size() == 0 ? codecvt_base::ok : codecvt_base::partial;
   }
 
   const char16_t*
-  ucs2_span(const char16_t* begin, const char16_t* end, size_t max,
+  ucs2_span(range<const char16_t, false>& from, size_t max,
             char32_t maxcode, codecvt_mode mode)
   {
-    range<const char16_t> from{ begin, end };
     read_utf16_bom(from, mode);
     // UCS-2 only supports characters in the BMP, i.e. one UTF-16 code unit:
     maxcode = std::min(max_single_utf16_unit, maxcode);
     char32_t c = 0;
     while (max-- && c <= maxcode)
       c = read_utf16_code_point(from, maxcode, mode);
-    return from.next;
+    return reinterpret_cast<const char16_t*>(from.next);
   }
 
   const char*
@@ -629,15 +717,14 @@ namespace
 
   // return pos such that [begin,pos) is valid UCS-4 string no longer than max
   const char16_t*
-  ucs4_span(const char16_t* begin, const char16_t* end, size_t max,
+  ucs4_span(range<const char16_t, false>& from, size_t max,
             char32_t maxcode = max_code_point, codecvt_mode mode = {})
   {
-    range<const char16_t> from{ begin, end };
     read_utf16_bom(from, mode);
     char32_t c = 0;
     while (max-- && c <= maxcode)
       c = read_utf16_code_point(from, maxcode, mode);
-    return from.next;
+    return reinterpret_cast<const char16_t*>(from.next);
   }
 }
 
@@ -937,6 +1024,13 @@ __codecvt_utf8_base<char32_t>::do_max_length() const throw()
 }
 
 #ifdef _GLIBCXX_USE_WCHAR_T
+
+#if __SIZEOF_WCHAR_T__ == 2
+static_assert(sizeof(wchar_t) == sizeof(char16_t), "");
+#elif __SIZEOF_WCHAR_T__ == 4
+static_assert(sizeof(wchar_t) == sizeof(char32_t), "");
+#endif
+
 // Define members of codecvt_utf8<wchar_t> base class implementation.
 // Converts from UTF-8 to UCS-2 or UCS-4 depending on sizeof(wchar_t).
 
@@ -1057,10 +1151,7 @@ do_out(state_type&, const intern_type* __from, const intern_type* __from_end,
        extern_type*& __to_next) const
 {
   range<const char16_t> from{ __from, __from_end };
-  range<char16_t> to{
-    reinterpret_cast<char16_t*>(__to),
-    reinterpret_cast<char16_t*>(__to_end)
-  };
+  range<char16_t, false> to{ __to, __to_end };
   auto res = ucs2_out(from, to, _M_maxcode, _M_mode);
   __from_next = from.next;
   __to_next = reinterpret_cast<char*>(to.next);
@@ -1083,14 +1174,13 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end,
       intern_type* __to, intern_type* __to_end,
       intern_type*& __to_next) const
 {
-  range<const char16_t> from{
-    reinterpret_cast<const char16_t*>(__from),
-    reinterpret_cast<const char16_t*>(__from_end)
-  };
+  range<const char16_t, false> from{ __from, __from_end };
   range<char16_t> to{ __to, __to_end };
   auto res = ucs2_in(from, to, _M_maxcode, _M_mode);
   __from_next = reinterpret_cast<const char*>(from.next);
   __to_next = to.next;
+  if (res == codecvt_base::ok && __from_next != __from_end)
+    res = codecvt_base::error;
   return res;
 }
 
@@ -1107,9 +1197,8 @@ __codecvt_utf16_base<char16_t>::
 do_length(state_type&, const extern_type* __from,
 	  const extern_type* __end, size_t __max) const
 {
-  auto next = reinterpret_cast<const char16_t*>(__from);
-  next = ucs2_span(next, reinterpret_cast<const char16_t*>(__end), __max,
-		   _M_maxcode, _M_mode);
+  range<const char16_t, false> from{ __from, __end };
+  const char16_t* next = ucs2_span(from, __max, _M_maxcode, _M_mode);
   return reinterpret_cast<const char*>(next) - __from;
 }
 
@@ -1137,10 +1226,7 @@ do_out(state_type&, const intern_type* __from, const intern_type* __from_end,
        extern_type*& __to_next) const
 {
   range<const char32_t> from{ __from, __from_end };
-  range<char16_t> to{
-    reinterpret_cast<char16_t*>(__to),
-    reinterpret_cast<char16_t*>(__to_end)
-  };
+  range<char16_t, false> to{ __to, __to_end };
   auto res = ucs4_out(from, to, _M_maxcode, _M_mode);
   __from_next = from.next;
   __to_next = reinterpret_cast<char*>(to.next);
@@ -1163,14 +1249,13 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end,
       intern_type* __to, intern_type* __to_end,
       intern_type*& __to_next) const
 {
-  range<const char16_t> from{
-    reinterpret_cast<const char16_t*>(__from),
-    reinterpret_cast<const char16_t*>(__from_end)
-  };
+  range<const char16_t, false> from{ __from, __from_end };
   range<char32_t> to{ __to, __to_end };
   auto res = ucs4_in(from, to, _M_maxcode, _M_mode);
   __from_next = reinterpret_cast<const char*>(from.next);
   __to_next = to.next;
+  if (res == codecvt_base::ok && __from_next != __from_end)
+    res = codecvt_base::error;
   return res;
 }
 
@@ -1187,9 +1272,8 @@ __codecvt_utf16_base<char32_t>::
 do_length(state_type&, const extern_type* __from,
 	  const extern_type* __end, size_t __max) const
 {
-  auto next = reinterpret_cast<const char16_t*>(__from);
-  next = ucs4_span(next, reinterpret_cast<const char16_t*>(__end), __max,
-		   _M_maxcode, _M_mode);
+  range<const char16_t, false> from{ __from, __end };
+  const char16_t* next = ucs4_span(from, __max, _M_maxcode, _M_mode);
   return reinterpret_cast<const char*>(next) - __from;
 }
 
@@ -1217,20 +1301,17 @@ do_out(state_type&, const intern_type* __from, const intern_type* __from_end,
        extern_type* __to, extern_type* __to_end,
        extern_type*& __to_next) const
 {
-  range<char16_t> to{
-    reinterpret_cast<char16_t*>(__to),
-    reinterpret_cast<char16_t*>(__to_end)
-  };
+  range<char16_t, false> to{ __to, __to_end };
 #if __SIZEOF_WCHAR_T__ == 2
   range<const char16_t> from{
     reinterpret_cast<const char16_t*>(__from),
-    reinterpret_cast<const char16_t*>(__from_end)
+    reinterpret_cast<const char16_t*>(__from_end),
   };
   auto res = ucs2_out(from, to, _M_maxcode, _M_mode);
 #elif __SIZEOF_WCHAR_T__ == 4
   range<const char32_t> from{
     reinterpret_cast<const char32_t*>(__from),
-    reinterpret_cast<const char32_t*>(__from_end)
+    reinterpret_cast<const char32_t*>(__from_end),
   };
   auto res = ucs4_out(from, to, _M_maxcode, _M_mode);
 #else
@@ -1257,20 +1338,17 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end,
       intern_type* __to, intern_type* __to_end,
       intern_type*& __to_next) const
 {
-  range<const char16_t> from{
-    reinterpret_cast<const char16_t*>(__from),
-    reinterpret_cast<const char16_t*>(__from_end)
-  };
+  range<const char16_t, false> from{ __from, __from_end };
 #if __SIZEOF_WCHAR_T__ == 2
   range<char16_t> to{
     reinterpret_cast<char16_t*>(__to),
-    reinterpret_cast<char16_t*>(__to_end)
+    reinterpret_cast<char16_t*>(__to_end),
   };
   auto res = ucs2_in(from, to, _M_maxcode, _M_mode);
 #elif __SIZEOF_WCHAR_T__ == 4
   range<char32_t> to{
     reinterpret_cast<char32_t*>(__to),
-    reinterpret_cast<char32_t*>(__to_end)
+    reinterpret_cast<char32_t*>(__to_end),
   };
   auto res = ucs4_in(from, to, _M_maxcode, _M_mode);
 #else
@@ -1278,6 +1356,8 @@ do_in(state_type&, const extern_type* __from, const extern_type* __from_end,
 #endif
   __from_next = reinterpret_cast<const char*>(from.next);
   __to_next = reinterpret_cast<wchar_t*>(to.next);
+  if (res == codecvt_base::ok && __from_next != __from_end)
+    res = codecvt_base::error;
   return res;
 }
 
@@ -1294,13 +1374,11 @@ __codecvt_utf16_base<wchar_t>::
 do_length(state_type&, const extern_type* __from,
 	  const extern_type* __end, size_t __max) const
 {
-  auto next = reinterpret_cast<const char16_t*>(__from);
+  range<const char16_t, false> from{ __from, __end };
 #if __SIZEOF_WCHAR_T__ == 2
-  next = ucs2_span(next, reinterpret_cast<const char16_t*>(__end), __max,
-		   _M_maxcode, _M_mode);
+  const char16_t* next = ucs2_span(from, __max, _M_maxcode, _M_mode);
 #elif __SIZEOF_WCHAR_T__ == 4
-  next = ucs4_span(next, reinterpret_cast<const char16_t*>(__end), __max,
-		   _M_maxcode, _M_mode);
+  const char16_t* next = ucs4_span(from, __max, _M_maxcode, _M_mode);
 #endif
   return reinterpret_cast<const char*>(next) - __from;
 }
diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc
index 9383818..d8b9729 100644
--- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc
@@ -103,6 +103,31 @@ test07()
   VERIFY( conv.converted() == 5 );
 }
 
+void
+test08()
+{
+  // Read/write UTF-16 code units from data not correctly aligned for char16_t
+  Conv<char16_t, 0x10FFFF, std::generate_header> conv;
+  const char src[] = "-\xFE\xFF\0\x61\xAB\xCD";
+  auto out = conv.from_bytes(src + 1, src + 7);
+  VERIFY( out[0] == 0x0061 );
+  VERIFY( out[1] == 0xabcd );
+  auto bytes = conv.to_bytes(out);
+  VERIFY( bytes == std::string(src + 1, 6) );
+}
+
+void
+test09()
+{
+  // Read/write UTF-16 code units from data not correctly aligned for char16_t
+  Conv<char32_t, 0x10FFFF, std::generate_header> conv;
+  const char src[] = "-\xFE\xFF\xD8\x08\xDF\x45";
+  auto out = conv.from_bytes(src + 1, src + 7);
+  VERIFY( out == U"\U00012345" );
+  auto bytes = conv.to_bytes(out);
+  VERIFY( bytes == std::string(src + 1, 6) );
+}
+
 int main()
 {
   test01();
@@ -112,4 +137,6 @@ int main()
   test05();
   test06();
   test07();
+  test08();
+  test09();
 }
diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc
new file mode 100644
index 0000000..0179c18
--- /dev/null
+++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/misaligned.cc
@@ -0,0 +1,289 @@
+// Copyright (C) 2017 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// { dg-do run { target c++11 } }
+
+#include <locale>
+#include <codecvt>
+#include <testsuite_hooks.h>
+
+using std::codecvt_base;
+using std::codecvt_mode;
+using std::codecvt_utf16;
+using std::wstring_convert;
+using std::mbstate_t;
+
+constexpr codecvt_mode
+operator|(codecvt_mode m1, codecvt_mode m2)
+{
+  using underlying = std::underlying_type<codecvt_mode>::type;
+  return static_cast<codecvt_mode>(static_cast<underlying>(m1) | m2);
+}
+
+// Read/write UTF-16 code units from data not correctly aligned for char16_t
+
+void
+test01()
+{
+  mbstate_t st;
+  constexpr codecvt_mode m = std::consume_header|std::generate_header;
+  codecvt_utf16<char16_t, 0x10FFFF, m> conv;
+  const char src[] = "-\xFE\xFF\0\x61\xAB\xCD";
+  const char* const src_end = src + 7;
+
+  int len = conv.length(st, src + 1, src_end, 1);
+  VERIFY( len == 4 );
+  len = conv.length(st, src + 1, src_end, 2);
+  VERIFY( len == 6 );
+
+  char16_t dst[2];
+  char16_t* const dst_end = dst + 2;
+  char16_t* dst_next;
+  const char* src_cnext;
+  auto res = conv.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( dst[0] == 0x0061 );
+  VERIFY( dst[1] == 0xabcd );
+  VERIFY( src_cnext == src_end );
+  VERIFY( dst_next == dst_end );
+
+  char out[sizeof(src)] = { src[0] };
+  char* const out_end = out + 7;
+  char* out_next;
+  const char16_t* dst_cnext;
+  res = conv.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( out_next == out_end );
+  VERIFY( dst_cnext == dst_end );
+  VERIFY( out[1] == src[1] );
+  VERIFY( out[2] == src[2] );
+  VERIFY( out[3] == src[3] );
+  VERIFY( out[4] == src[4] );
+  VERIFY( out[5] == src[5] );
+  VERIFY( out[6] == src[6] );
+
+  codecvt_utf16<char16_t, 0x10FFFF, m|std::little_endian> conv_le;
+
+  len = conv_le.length(st, src + 1, src_end, 1);
+  VERIFY( len == 4 );
+  len = conv_le.length(st, src + 1, src_end, 2);
+  VERIFY( len == 6 );
+
+  res = conv_le.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( dst[0] == 0x0061 );
+  VERIFY( dst[1] == 0xabcd );
+  VERIFY( src_cnext == src_end );
+  VERIFY( dst_next == dst_end );
+
+  res = conv_le.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( out_next == out_end );
+  VERIFY( dst_cnext == dst_end );
+  VERIFY( out[1] == src[2] );
+  VERIFY( out[2] == src[1] );
+  VERIFY( out[3] == src[4] );
+  VERIFY( out[4] == src[3] );
+  VERIFY( out[5] == src[6] );
+  VERIFY( out[6] == src[5] );
+}
+
+void
+test02()
+{
+  mbstate_t st;
+  constexpr codecvt_mode m = std::consume_header|std::generate_header;
+  codecvt_utf16<char32_t, 0x10FFFF, m> conv;
+  const char src[] = "-\xFE\xFF\0\x61\xAB\xCD\xD8\x08\xDF\x45";
+  const char* const src_end = src + 11;
+
+  int len = conv.length(st, src + 1, src_end, 1);
+  VERIFY( len == 4 );
+  len = conv.length(st, src + 1, src_end, 2);
+  VERIFY( len == 6 );
+  len = conv.length(st, src + 1, src_end, -1ul);
+  VERIFY( len == 10 );
+
+  char32_t dst[3];
+  char32_t* const dst_end = dst + 3;
+  char32_t* dst_next;
+  const char* src_cnext;
+  auto res = conv.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( dst[0] == 0x0061 );
+  VERIFY( dst[1] == 0xabcd );
+  VERIFY( dst[2] == 0x012345 );
+  VERIFY( src_cnext == src_end );
+  VERIFY( dst_next == dst_end );
+
+  char out[sizeof(src)] = { src[0] };
+  char* const out_end = out + 11;
+  char* out_next;
+  const char32_t* dst_cnext;
+  res = conv.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( out_next == out_end );
+  VERIFY( dst_cnext == dst_end );
+  VERIFY( out[1] == src[1] );
+  VERIFY( out[2] == src[2] );
+  VERIFY( out[3] == src[3] );
+  VERIFY( out[4] == src[4] );
+  VERIFY( out[5] == src[5] );
+  VERIFY( out[6] == src[6] );
+  VERIFY( out[7] == src[7] );
+  VERIFY( out[8] == src[8] );
+  VERIFY( out[9] == src[9] );
+  VERIFY( out[10] == src[10] );
+
+  codecvt_utf16<char32_t, 0x10FFFF, m|std::little_endian> conv_le;
+
+  len = conv_le.length(st, src + 1, src_end, 1);
+  VERIFY( len == 4 );
+  len = conv_le.length(st, src + 1, src_end, 2);
+  VERIFY( len == 6 );
+  len = conv.length(st, src + 1, src_end, -1ul);
+  VERIFY( len == 10 );
+
+  res = conv_le.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( dst[0] == 0x0061 );
+  VERIFY( dst[1] == 0xabcd );
+  VERIFY( dst[2] == 0x012345 );
+  VERIFY( src_cnext == src_end );
+  VERIFY( dst_next == dst_end );
+
+  res = conv_le.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( out_next == out_end );
+  VERIFY( dst_cnext == dst_end );
+  VERIFY( out[1] == src[2] );
+  VERIFY( out[2] == src[1] );
+  VERIFY( out[3] == src[4] );
+  VERIFY( out[4] == src[3] );
+  VERIFY( out[5] == src[6] );
+  VERIFY( out[6] == src[5] );
+  VERIFY( out[7] == src[8] );
+  VERIFY( out[8] == src[7] );
+  VERIFY( out[9] == src[10] );
+  VERIFY( out[10] == src[9] );
+}
+
+void
+test03()
+{
+#ifdef _GLIBCXX_USE_WCHAR_T
+  mbstate_t st;
+  constexpr codecvt_mode m = std::consume_header|std::generate_header;
+  codecvt_utf16<wchar_t, 0x10FFFF, m> conv;
+  const char src[] = "-\xFE\xFF\0\x61\xAB\xCD\xD8\x08\xDF\x45";
+  const size_t in_len = sizeof(wchar_t) == 4 ? 11 : 7;
+  const size_t out_len = sizeof(wchar_t) == 4 ? 3 : 2;
+  const char* const src_end = src + in_len;
+
+  int len = conv.length(st, src + 1, src_end, 1);
+  VERIFY( len == 4 );
+  len = conv.length(st, src + 1, src_end, 2);
+  VERIFY( len == 6 );
+  if (sizeof(wchar_t) == 4)
+  {
+    len = conv.length(st, src + 1, src_end, -1ul);
+    VERIFY( len == 10 );
+  }
+
+  wchar_t dst[out_len];
+  wchar_t* const dst_end = dst + out_len;
+  wchar_t* dst_next;
+  const char* src_cnext;
+  auto res = conv.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( dst[0] == 0x0061 );
+  VERIFY( dst[1] == 0xabcd );
+  if (sizeof(wchar_t) == 4)
+    VERIFY( dst[2] == 0x012345 );
+  VERIFY( src_cnext == src_end );
+  VERIFY( dst_next == dst_end );
+
+  char out[sizeof(src)] = { src[0] };
+  char* const out_end = out + in_len;
+  char* out_next;
+  const wchar_t* dst_cnext;
+  res = conv.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( out_next == out_end );
+  VERIFY( dst_cnext == dst_end );
+  VERIFY( out[1] == src[1] );
+  VERIFY( out[2] == src[2] );
+  VERIFY( out[3] == src[3] );
+  VERIFY( out[4] == src[4] );
+  VERIFY( out[5] == src[5] );
+  VERIFY( out[6] == src[6] );
+  if (sizeof(wchar_t) == 4)
+  {
+    VERIFY( out[7] == src[7] );
+    VERIFY( out[8] == src[8] );
+    VERIFY( out[9] == src[9] );
+    VERIFY( out[10] == src[10] );
+  }
+
+  codecvt_utf16<wchar_t, 0x10FFFF, m|std::little_endian> conv_le;
+
+  len = conv_le.length(st, src + 1, src_end, 1);
+  VERIFY( len == 4 );
+  len = conv_le.length(st, src + 1, src_end, 2);
+  VERIFY( len == 6 );
+  if (sizeof(wchar_t) == 4)
+  {
+    len = conv.length(st, src + 1, src_end, -1ul);
+    VERIFY( len == 10 );
+  }
+
+  res = conv_le.in(st, src + 1, src_end, src_cnext, dst, dst_end, dst_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( dst[0] == 0x0061 );
+  VERIFY( dst[1] == 0xabcd );
+  if (sizeof(wchar_t) == 4)
+    VERIFY( dst[2] == 0x012345 );
+  VERIFY( src_cnext == src_end );
+  VERIFY( dst_next == dst_end );
+
+  res = conv_le.out(st, dst, dst_end, dst_cnext, out + 1, out_end, out_next);
+  VERIFY( res == codecvt_base::ok );
+  VERIFY( out_next == out_end );
+  VERIFY( dst_cnext == dst_end );
+  VERIFY( out[1] == src[2] );
+  VERIFY( out[2] == src[1] );
+  VERIFY( out[3] == src[4] );
+  VERIFY( out[4] == src[3] );
+  VERIFY( out[5] == src[6] );
+  VERIFY( out[6] == src[5] );
+  if (sizeof(wchar_t) == 4)
+  {
+    VERIFY( out[7] == src[8] );
+    VERIFY( out[8] == src[7] );
+    VERIFY( out[9] == src[10] );
+    VERIFY( out[10] == src[9] );
+  }
+#endif
+}
+
+int
+main()
+{
+  test01();
+  test02();
+  test03();
+}

Re: [PATCH] Various fixes for facets

Reply via email to