Re: about tex2lyx and Unicode

Georg Baum Mon, 13 Oct 2008 12:31:55 -0700

Uwe Stöhr wrote:

> "magic" does not help in general as we will still stay sticked where we
> are. We need a general solution to be able to create a lyxformat newer
> than 248. I attached a LyX file and its TeX output. This one compiles fine
> with latex and pdflatex. The exercise we have is to import this TeX file
> to get the same as in the LyX file. (I already implemented tex2lyx support
> for the language handling, like \selectlanguage. What is missing is the
> handling of the encoding.)


I don't see the problem with the patch. IIRC an utf8 encoding (if specified
explicitly) is also valid for format 248. Of course the patch does not
solve the general unicode problem in tex2lyx. I am absolutely no fan of
magic either, but AFAICS in the xetex case there is no other way than to
apply magic, either like this or by trying to detect the used encoding.
Both methods are not 100% reliable, but work in most cases. So, the patch
improves the situation, but of course much more work is needed. In the long
run an autodetection of the encoding is probably better, because not all
xetex documents have this special comment. If the encoding can be overriden
from the command line (for the rare cases where the autodetection does not
work) every document can be converted. For the moment the
conclusion "special comment _and_ no inputenc command => utf8" is even 100%
safe: Both ascii and latin1 are a subset of utf8, and tex2lyx cannot handle
any other encoding anway (it outputs hardcoded latin1 in several cases (bug
4299)). Therefore even a hardcoded utf8 would be an improvement!

> I prefer that we agree to a basic concept how to fix this. I proposed one
> that will work in all cases, but I don't know if iconv can handle that.

The natural way would be to convert tex2lyx to docstring, use an ifdocstream
to read the file (beginning with an ascii encoding), and switch the
encoding whenever a command like \inputencoding (or a magic comment, or, if
at the beginning, a non-ascii character) is read. This would work very
similar to the LaTeX export mechanism, and should not be too difficult to
implement. When I implemented the codecvt facets I made sure that they work
both for output and input (don't know if that is still the case).

> What I wanted to point out in my last comments is that a TeX file is in
> general a multi-encoding document. The encoding of the different document
> parts are given by the options of inputenc and \inputencoding, see so the
> attached TeX file.

Exactly. And you can have even more fun with the different methods to switch
encodings for CJK languages ;-)

> So my proposal is to read the encodings from the TeX file and convert the
> document parts via iconv to uft8 and build then the LyX file.

I believe that using the existing codecvt facets and idocstream is easier
than calling iconv directly, because it does not interfere with the
structure of the output.

BTW, does the fact that you are now working on this mean that the
tex2lyx-python ghost is finally dead?


Georg

Re: about tex2lyx and Unicode

Reply via email to