[patch] bug 4012: Reference eats up one char before when the char is double byte

Jürgen Spitzmüller Sat, 14 Jul 2007 04:55:09 -0700

http://bugzilla.lyx.org/show_bug.cgi?id=4012


This patch fixes a real severe issue with CJK languages: if a cjk encoding 
other than UTF8 (such as EUC-JP) is used, multibyte characters simply vanish 
in the LaTeX output. This is dataloss. Since utf8 is not available on all 
systems, this renders 1.5 useless for some cjk users.

The reason is that we do only treat utf8 as multibyte encoding, not the cjk 
encodings. The patch that I came up with Georg's help is rather simple: just 
consider the cjk encodings in iconv_codecvt_facet::do_max_length().

Due to the severity, I'd really like to put this into 1.5.0.

José, can I do this?

Jürgen

Index: src/support/docstream.cpp
===================================================================
--- src/support/docstream.cpp	(Revision 19069)
+++ src/support/docstream.cpp	(Arbeitskopie)
@@ -206,13 +206,32 @@
 	}
 	virtual int do_max_length() const throw()
 	{
+		// FIXME: this information should be transferred to lib/encodings
 		// UTF8 uses at most 4 bytes to represent one UCS4 code point
 		// (see RFC 3629). RFC 2279 specifies 6 bytes, but that
 		// information is outdated, and RFC 2279 has been superseded by
 		// RFC 3629.
+		// The CJK encodings use (different) multibyte representation as well.
 		// All other encodings encode one UCS4 code point in one byte
 		// (and can therefore only encode a subset of UCS4)
-		return encoding_ == "UTF-8" ? 4 : 1;
+		// Note that BIG5 and SJIS do not work with LaTeX (see lib/encodings). 
+		// Furthermore, all encodings that use shifting (like SJIS) do not work with 
+		// iconv_codecvt_facet.
+		if (encoding_ == "UTF-8" ||
+		    encoding_ == "GB" ||
+		    encoding_ == "EUC-TW")
+			return 4;
+		else if (encoding_ == "EUC-JP")
+			return 3;
+		else if (encoding_ == "BIG5" ||
+			 encoding_ == "EUC-KR" ||
+			 encoding_ == "EUC-CN" ||
+			 encoding_ == "SJIS" ||
+			 encoding_ == "GBK" ||
+			 encoding_ == "JIS" )
+			return 2;
+		else
+			return 1;
 	}
 private:
 	/// Do the actual conversion. The interface is equivalent to that of

[patch] bug 4012: Reference eats up one char before when the char is double byte

Reply via email to