Hebrew encoding (cp1255)

Dov Feldstern Thu, 28 Dec 2006 15:32:58 -0800

Georg Baum wrote:

On Sunday 24 December 2006 01:02, Dov Feldstern wrote:

Okay, this is the issue: if a paragraph's language is Hebrew, then the
encoding is identified correctly using "use language's default
encoding". However, if the paragraph's language is English, but you have
some Hebrew words in it, then LyX complains about the encoding.
Switching manually to cp1255 solves this. Perhaps auto identification of
the encoding by language is done at the paragraph level, not the
individual character level?

Yes. The first character determines the encoding of the whole pragraph. I don't know why, maybe it is latex limitation.


Georg

Ok, I see the problem. The thing is, that earlier version of LyX dealt with this correctly. So I compared what's happening now with what happens in 1.4.2:

In 1.4.2, one sets the encoding to "LaTeX default" which is different than "use language's default encoding": "LaTeX default" means that the generated latex file doesn't include any explicit encoding information at all. Rather, LyX let's latex decide what the encoding is, based on the language. (See attached heb142-default.{lyx,tex}). Latex apparently does a good job of this, as the generated output is correct, even in the case of a Hebrew word in an English paragraph. OTOH (still in 1.4.2), "use language's default encoding" means that LyX generates the latex file with explicit encoding commands, using the inputenc package. (See attached heb142-auto.{lyx,tex}). And that's where the limitation of inputenc comes into play, and then we see that the Hebrew words in English paragraphs are not encoded correctly.

So basically, in pre-1.5, the solution is just to use the "default" encoding, rather than "auto".

Now we move to 1.5.0: when you try to use "default", you get the following message in the stderr:


Unknown inputenc value `default'. Using `auto' instead.

So now there's no way to generate the latex file without the explicit encodings, which means that we're stuck with the problem I originally described, because of inputenc's limitations.

One solution would be to see if we could fix this using a newer version of inputenc, as Jean-Marc suggested. But perhaps we could solve this by again using the "default" encoding option? I realize that in 1.5.0 it's harder than in previous versions: now LyX itself has to know what the encoding should be, so that it can generate the latex file correctly. OTOH, it *should* already know that --- it's explicitly writing that information to the generated latex file! So all that really needs to be done is to *not* write the explicit encoding commands to the generated latex file, if the "default" encoding option is chosen.

But here's where the second problem arises, and this time it's LyX's problem, not latex's (though I'm less sure about this part): it seems to me like LyX itself --- not only latex --- is also determining the encoding based on the paragraph, rather than based on the individual characters' language. And that means that it fails to generate the latex file correctly, because while trying to generate the latex file it suddenly has Hebrew characters in the middle of a paragraph which it thought was English. So iconv fails, with an error messages like:

Error 84 returned from iconv: Invalid or incomplete multibyte or wide character

and the latex file is not generated. In fact, f you look in the temporary directory, you can see that the latex file is not fully generated --- it is generated only up to the problematic paragraph, and that's where it ends.

If LyX would perform the conversions on a per-character basis (or rather, for consecutive characters with the same encoding), then it would at least be able to generate the latex file, and then we'd only be left with the first problem.

The truth is that looking back, this problem is not very acute, in the sense that it's possible to already generate Hebrew correctly, by just setting the document encoding to cp1255 --- then everything is fine. However, when we finally move to "real unicode support", then this issue will have to be addressed --- because by setting the encoding to cp1255,we exclude all other encoding, which means, for example, that it would be impossible to write a document with Hebrew and Arabic intermixed. But again, this can wait, I guess, until the move to "real unicode". Also, perhaps these same issues may solve other problems that people are seeing with encodings in various languages.

#LyX 1.4.2 created this file. For more info see http://www.lyx.org/
\lyxformat 245
\begin_document
\begin_header
\textclass article
\language english
\inputencoding default
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize default
\use_geometry false
\use_amsmath 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\end_header

\begin_body

\begin_layout Standard
this is english
\end_layout

\begin_layout Standard

\lang hebrew
åæä, ìòåîú æàú, áòáøéú.
\end_layout

\begin_layout Standard
english with 
\lang hebrew
òáøéú
\lang english
 works fine.
\end_layout

\begin_layout Standard

\lang hebrew
òáøéú òí 
\lang english
english
\lang hebrew
 âí òåáã áñãø.
\end_layout

\end_body
\end_document

heb142-default.tex
Description: TeX document

#LyX 1.4.2 created this file. For more info see http://www.lyx.org/
\lyxformat 245
\begin_document
\begin_header
\textclass article
\language english
\inputencoding auto
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize default
\use_geometry false
\use_amsmath 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\end_header

\begin_body

\begin_layout Standard
this is english
\end_layout

\begin_layout Standard

\lang hebrew
åæä, ìòåîú æàú, áòáøéú.
\end_layout

\begin_layout Standard
english with 
\lang hebrew
òáøéú
\lang english
 works fine.
\end_layout

\begin_layout Standard

\lang hebrew
òáøéú òí 
\lang english
english
\lang hebrew
 âí òåáã áñãø.
\end_layout

\end_body
\end_document

heb142-auto.tex
Description: TeX document

Hebrew encoding (cp1255)

Reply via email to