On Fri, Oct 23, 2009 at 12:36 AM, rgheck <rgh...@bobjweil.com> wrote: >> Producing HTML output by parsing LyX documents is what eLyXer does. It >> has so far proved to be an interesting challenge, and in my biased >> opinion the results are better than any TeX->HTML converters or than >> native HTML output from within LyX. > > Since the native HTML output from within LyX isn't fully implemented (to say > the least), that is not a fair comparison. Once it has been, then there will > be something to discuss.
That is kind of the point: native HTML output has been in the works for a long time and it has not given the expected results. Now, I don't know how much time you have devoted to it; but my guess is that the problem is much more complex than what eLyXer set out to solve. > But Pavel's point, and the point all along, has been that anything you can > do by running some one-pass filter on the native format you can also do by > recursing over the Buffer structrue as LyX internally represents it. The > converse is not true. That ought to be obvious. (It's probably even a > theorem.) Whether what you can't do with a one-pass filter *matters* is > perhaps what people disagree about. We are not really interested in what is possible, rather in what is easier to do. But finding out what is possible is an interesting starting point. (Warning: long rant follows, possibly full of inaccuracies and/or educated guesses, but I promise to avoid flames.) Let us suppose that our goal is to know the coordinates of all elements in LyX output; this would include positioning, sizes and colors, ultimately to the last pixel on the screen for a certain resolution and window size. The information in the internal LyX structures (let's call it "LyX memory") is not enough to provide the full screen output; it requires a lot of state bits contained in LyX itself and in the preferences ("LyX state"). They are all needed to know where to place every element on the screen in whole detail (for example, that font size "Large" is actually 14 point, or whatever), so you need "LyX memory + LyX state", where "+" means "combine". This is similarly true for a TeX file, since you need to know how TeX interprets every element; in this case you would need "TeX format + TeX state". It is less true for PDF output where the positions of elements are already included, and only fonts have to be rendered. In fact you can render the PDF not as text but as segments of arc; in this case it takes a lot more space but includes every pixel, or alternatively you can include all fonts in the PDF. So "PDF format" on its own is enough to render the document, and indeed different code bases (AcroRead, evince, xpdf) yield very similar results. Now, the situation with "LyX format" (contents of a .lyx file, as opposed to "LyX memory") is exactly the same as with "LyX memory", if you think about it: you still need "LyX format + LyX state" (a LyX file + the LyX application + LyX preferences) to render every element. So in fact the only difference is that with "LyX memory" you already have all the necessary state loaded into memory, while with "LyX format" (the file) you have to load it and create the structures that make up "LyX memory". The information in both "LyX format" and "LyX memory" have to be equivalent except for a number of pieces of state that can be taken from "LyX state", contrary to your "theorem" above. LyX format + LyX state -> LyX memory We want to introduce a new element "HTML format", which is the exact HTML file that generates output most similar to the LyX output. This can be obtained from "LyX format + LyX state", but we need a new set of variables that we will call "HTML state" to make the conversion. This will say that e.g. font size "Large" is equivalent to "font-size: 12pt" in HTML. LyX format + LyX state + HTML state -> HTML format The same is true with "LyX memory + LyX state": we will also need this "HTML state" to obtain "HTML format". Also for "TeX format + TeX state", although in this last case the required "HTML state" may differ. TeX format + TeX state + HTML state' -> HTML format So, now we know what is possible. What is easy to do? It is easier to start with "LyX memory" and combine it with "LyX state", since it's all in memory; than load "LyX format" and combine it with "LyX state". But since LyX is a complex beast, it is not easy to recreate all the graphical output in a different format like HTML to generate "HTML format". The task with TeX is perhaps even more daunting; even if "TeX state" is smaller than "LyX state" (I have absolutely no idea but I guess not), the "TeX format" is quite more complex and supports many more things. This is all without taking into account raw TeX inserted within LyX (the infamous ERT); with ERT then "LyX format" has to include "TeX format", and "LyX state" includes "TeX state", so the problem for LyX->HTML grows exponentially. LyX format + TeX format + LyX state + TeX state + HTML state -> HTML format eLyXer takes a shortcut here to simplify the problem, and now we leave the realm of theory and into the subtleties of engineering: it takes "LyX format" and combines it, not with "LyX state + HTML state", but with a new "eLyXer state" which yields an approximation to "HTML format"; we will call the result "eLyXer format". LyX format + eLyXer state -> eLyXer format ~ HTML format The output is similar in most cases to the true "HTML format", but the objective is not to achieve 100% compatibility -- for starters, ERT is not even considered. (But, and this is important, the new native HTML output does not either try to generate HTML from ERT, so that it will never be perfect either.) The problem of combining "LyX format + eLyXer state" to generate "eLyXer format" is orders of magnitude simpler than "TeX format + TeX state", since "LyX format" is much simpler than "TeX format" and "eLyXer state" much simpler than "TeX state": eLyXer is about 5k lines long, and its contents is focused on generating HTML. The same is true for "LyX memory + LyX state": "LyX format" is as complex as "LyX memory" (or simpler) and "eLyXer state" is still much simpler than "LyX state". Actually the problem is made even simpler by dividing the result into "eLyXer format" (the ouput of a conversion) and "CSS format", a CSS file which contains most presentation information and which does not change. Some of the consequences should be easy to guess: eLyXer generates acceptable HTML for a growing subset of LyX documents; it can generate acceptable HTML from almost the beginning. While it will never be perfect, nevertheless it should be good enough for a growing (albeit much more slowly) percentage of LyX users. The discussion at this point centers around what these percentages are, as you hint above: even if format coverage grows linearly, there will always be someone using an odd feature not covered (or not looking good) in eLyXer. We can guess the typical S-shaped curve that grows slowly first, very quickly at a given point and then slows down again, asymptotically approaching 100% as format coverage grows. We don't know if right now eLyXer has the potential to serve well 50% of LyX users or 90%, and we probably don't have any way to know barring an extensive survey; but it will grow slowly from now on. But there is an additional consequence not usually considered by LyX developers: since "eLyXer state" is independent of "LyX state", and much simpler, eLyXer enables new uses for "LyX format" (i.e. .lyx files) for people that won't or can't install LyX. Examples: remote servers, low-powered machines, odd unsupported platforms for LyX, not for Python: http://www.python.org/download/other/ And also command line pipes, background operations... so being kept as a standalone is also an important goal for eLyXer, and one that potentially extends the uses of LyX at the same time. > But whatever. We've been over this ground before, as Pavel said, and I > certainly have no interest in rehashing it. Nor, I'm guessing, does anyone > else. If you have no interest in rehashing it, then don't; but please do not assume that nobody else is. Alex.