On Fri, Oct 23, 2009 at 12:36 AM, rgheck <rgh...@bobjweil.com> wrote:
>> Producing HTML output by parsing LyX documents is what eLyXer does. It
>> has so far proved to be an interesting challenge, and in my biased
>> opinion the results are better than any TeX->HTML converters or than
>> native HTML output from within LyX.
>
> Since the native HTML output from within LyX isn't fully implemented (to say
> the least), that is not a fair comparison. Once it has been, then there will
> be something to discuss.

That is kind of the point: native HTML output has been in the works
for a long time and it has not given the expected results. Now, I
don't know how much time you have devoted to it; but my guess is that
the problem is much more complex than what eLyXer set out to solve.

> But Pavel's point, and the point all along, has been that anything you can
> do by running some one-pass filter on the native format you can also do by
> recursing over the Buffer structrue as LyX internally represents it. The
> converse is not true. That ought to be obvious. (It's probably even a
> theorem.) Whether what you can't do with a one-pass filter *matters* is
> perhaps what people disagree about.

We are not really interested in what is possible, rather in what is
easier to do. But finding out what is possible is an interesting
starting point. (Warning: long rant follows, possibly full of
inaccuracies and/or educated guesses, but I promise to avoid flames.)

Let us suppose that our goal is to know the coordinates of all
elements in LyX output; this would include positioning, sizes and
colors, ultimately to the last pixel on the screen for a certain
resolution and window size. The information in the internal LyX
structures (let's call it "LyX memory") is not enough to provide the
full screen output; it requires a lot of state bits contained in LyX
itself and in the preferences ("LyX state"). They are all needed to
know where to place every element on the screen in whole detail (for
example, that font size "Large" is actually 14 point, or whatever), so
you need "LyX memory + LyX state", where "+" means "combine". This is
similarly true for a TeX file, since you need to know how TeX
interprets every element; in this case you would need "TeX format +
TeX state". It is less true for PDF output where the positions of
elements are already included, and only fonts have to be rendered. In
fact you can render the PDF not as text but as segments of arc; in
this case it takes a lot more space but includes every pixel, or
alternatively you can include all fonts in the PDF. So "PDF format" on
its own is enough to render the document, and indeed different code
bases (AcroRead, evince, xpdf) yield very similar results.

Now, the situation with "LyX format" (contents of a .lyx file, as
opposed to "LyX memory") is exactly the same as with "LyX memory", if
you think about it: you still need "LyX format + LyX state" (a LyX
file + the LyX application + LyX preferences) to render every element.
So in fact the only difference is that with "LyX memory" you already
have all the necessary state loaded into memory, while with "LyX
format" (the file) you have to load it and create the structures that
make up "LyX memory". The information in both "LyX format" and "LyX
memory" have to be equivalent except for a number of pieces of state
that can be taken from "LyX state", contrary to your "theorem" above.
      LyX format + LyX state -> LyX memory

We want to introduce a new element "HTML format", which is the exact
HTML file that generates output most similar to the LyX output. This
can be obtained from "LyX format + LyX state", but we need a new set
of variables that we will call "HTML state" to make the conversion.
This will say that e.g. font size "Large" is equivalent to "font-size:
12pt" in HTML.
      LyX format + LyX state + HTML state -> HTML format
The same is true with "LyX memory + LyX state": we will also need this
"HTML state" to obtain "HTML format". Also for "TeX format + TeX
state", although in this last case the required "HTML state" may
differ.
      TeX format + TeX state + HTML state' -> HTML format

So, now we know what is possible. What is easy to do? It is easier to
start with "LyX memory" and combine it with "LyX state", since it's
all in memory; than load "LyX format" and combine it with "LyX state".
But since LyX is a complex beast, it is not easy to recreate all the
graphical output in a different format like HTML to generate "HTML
format". The task with TeX is perhaps even more daunting; even if "TeX
state" is smaller than "LyX state" (I have absolutely no idea but I
guess not), the "TeX format" is quite more complex and supports many
more things. This is all without taking into account raw TeX inserted
within LyX (the infamous ERT); with ERT then "LyX format" has to
include "TeX format", and "LyX state" includes "TeX state", so the
problem for LyX->HTML grows exponentially.
      LyX format + TeX format + LyX state + TeX state + HTML state ->
HTML format

eLyXer takes a shortcut here to simplify the problem, and now we leave
the realm of theory and into the subtleties of engineering: it takes
"LyX format" and combines it, not with "LyX state + HTML state", but
with a new "eLyXer state" which yields an approximation to "HTML
format"; we will call the result "eLyXer format".
      LyX format + eLyXer state -> eLyXer format ~ HTML format
The output is similar in most cases to the true "HTML format", but the
objective is not to achieve 100% compatibility -- for starters, ERT is
not even considered. (But, and this is important, the new native HTML
output does not either try to generate HTML from ERT, so that it will
never be perfect either.) The problem of combining "LyX format +
eLyXer state" to generate "eLyXer format" is orders of magnitude
simpler than "TeX format + TeX state", since "LyX format" is much
simpler than "TeX format" and "eLyXer state" much simpler than "TeX
state": eLyXer is about 5k lines long, and its contents is focused on
generating HTML. The same is true for "LyX memory + LyX state": "LyX
format" is as complex as "LyX memory" (or simpler) and "eLyXer state"
is still much simpler than "LyX state". Actually the problem is made
even simpler by dividing the result into "eLyXer format" (the ouput of
a conversion) and "CSS format", a CSS file which contains most
presentation information and which does not change.

Some of the consequences should be easy to guess: eLyXer generates
acceptable HTML for a growing subset of LyX documents; it can generate
acceptable HTML from almost the beginning. While it will never be
perfect, nevertheless it should be good enough for a growing (albeit
much more slowly) percentage of LyX users. The discussion at this
point centers around what these percentages are, as you hint above:
even if format coverage grows linearly, there will always be someone
using an odd feature not covered (or not looking good) in eLyXer. We
can guess the typical S-shaped curve that grows slowly first, very
quickly at a given point and then slows down again, asymptotically
approaching 100% as format coverage grows. We don't know if right now
eLyXer has the potential to serve well 50% of LyX users or 90%, and we
probably don't have any way to know barring an extensive survey; but
it will grow slowly from now on.

But there is an additional consequence not usually considered by LyX
developers: since "eLyXer state" is independent of "LyX state", and
much simpler, eLyXer enables new uses for "LyX format" (i.e. .lyx
files) for people that won't or can't install LyX. Examples: remote
servers, low-powered machines, odd unsupported platforms for LyX, not
for Python:
  http://www.python.org/download/other/
And also command line pipes, background operations... so being kept as
a standalone is also an important goal for eLyXer, and one that
potentially extends the uses of LyX at the same time.

> But whatever. We've been over this ground before, as Pavel said, and I
> certainly have no interest in rehashing it. Nor, I'm guessing, does anyone
> else.

If you have no interest in rehashing it, then don't; but please do not
assume that nobody else is.

Alex.

Reply via email to