http://bugzilla.lyx.org/show_bug.cgi?id=4012
This patch fixes a real severe issue with CJK languages: if a cjk encoding other than UTF8 (such as EUC-JP) is used, multibyte characters simply vanish in the LaTeX output. This is dataloss. Since utf8 is not available on all systems, this renders 1.5 useless for some cjk users. The reason is that we do only treat utf8 as multibyte encoding, not the cjk encodings. The patch that I came up with Georg's help is rather simple: just consider the cjk encodings in iconv_codecvt_facet::do_max_length(). Due to the severity, I'd really like to put this into 1.5.0. José, can I do this? Jürgen
Index: src/support/docstream.cpp =================================================================== --- src/support/docstream.cpp (Revision 19069) +++ src/support/docstream.cpp (Arbeitskopie) @@ -206,13 +206,32 @@ } virtual int do_max_length() const throw() { + // FIXME: this information should be transferred to lib/encodings // UTF8 uses at most 4 bytes to represent one UCS4 code point // (see RFC 3629). RFC 2279 specifies 6 bytes, but that // information is outdated, and RFC 2279 has been superseded by // RFC 3629. + // The CJK encodings use (different) multibyte representation as well. // All other encodings encode one UCS4 code point in one byte // (and can therefore only encode a subset of UCS4) - return encoding_ == "UTF-8" ? 4 : 1; + // Note that BIG5 and SJIS do not work with LaTeX (see lib/encodings). + // Furthermore, all encodings that use shifting (like SJIS) do not work with + // iconv_codecvt_facet. + if (encoding_ == "UTF-8" || + encoding_ == "GB" || + encoding_ == "EUC-TW") + return 4; + else if (encoding_ == "EUC-JP") + return 3; + else if (encoding_ == "BIG5" || + encoding_ == "EUC-KR" || + encoding_ == "EUC-CN" || + encoding_ == "SJIS" || + encoding_ == "GBK" || + encoding_ == "JIS" ) + return 2; + else + return 1; } private: /// Do the actual conversion. The interface is equivalent to that of