> I have been looking at the formats of documents and it seems like
> there is NO standard format in which to write documents.
>
> -There seem to exist several differente versions of RTF, so this is no
> standard.
> -XMGL seems to be a standard for the future, but word-processors
> for it do exist, at least in the low-end market.
> -The only standard right now for this is HTML and the most extended
> version (I think it is 3) does not support formulas. HTML 4 does
> seem to support them. But there already exists a latex2html converter
> which works quite well.
>
> Taken into account all this, I am dropping the idea of making a
> lyx2rtf (and viceversa) converter. The orginal idea was to be able to
> interchange documents with Windows word-processors, but this seems to
> be a difficult task given the above conditions.
>
> The non-existance of a standard format is really bad news for
> everybody. What is happening is that the format being imposed is M$
> Word.
Yes, I agree that it's sad that there is no common, good format that is
easily interchanged.
However, I doubt this will ever happen, because word processors are so
different. At one extreme, we have raw desktop-publishing systems such
as Framemaker and Pagemaker (or arguably Corel Draw). They use a document
model where lots of coordinates and concrete attirbutes are needed, such
that the output can be controlled in every detail.
At the other end, we have something like "ideal" DocBook, where you only
specify what the different elements in the document are, and not how they
look.
In between there are lots of hybrids. LyX belongs closer to DocBook than
to Pagemaker, but because we can control some aspects of output, elements
of the DTP kind of document has to exists alongside with some aspects
of the logical information in DocBook.
Word lies closer to DTP than does LyX, but still Word has features for
logical editing with styles and other "mark-up".
So, if you can create a hybrid document format which caters for both the
extreme DTP end of things, and the completely logical end of things, one
could think that you would be all set. But this is a false impression,
because even if such a format exists (and it arguably does to some extent
in the combination of DSSSL and DocBook), you still have to rewrite all
word processors out there to support all the information in the format.
What good is it that you have all the information about a document saved
if your word processor converts all the information it doesn't natively
support into something else.
In essence, you have to change all word processors to become both a DTP
system and a logical editing system at the same time. This is simply
not practical. I know of no existing systems that can do that. The domains
of the problem are too different, and technology as we know it is not
good enough to some the problem of ambigiouty in the information.
The development over time, as I can tell it, is that a given word processor
starts in one area, and moves out to cover more ground. For instance,
TeX started out mostly as a kind of DTP system with built-in constraints
to help enforce great output quality, but in essence, TeX is not much
more than a DTP engine.
LaTeX was built on top of this to make it a more logical document system,
and thus the domain of LaTeX in a sense covers more ground than the raw
TeX. (Of course this distinction is arbitrary, because LaTeX is implemented
in TeX, but if you look at what information you can extract from a typical
LaTeX document compared to a typical TeX document, you'll find more logical
information in the LaTeX one.)
Similar with Microsoft Word. For starters, it was the "glorified" type-
writer than John Weiss has talked about in our introduction. Later, more
and more advanced DTP elements were introduced: Upside/down text, pictures
and drawings in the middle of everything and so fort. At the same time,
Word also introduced more and more logical editing in the form of styles.
So now, Word is a strange hybrid of DTP and logical editing. The consequence
is that most Word users don't know how to use anything.
And the story is the same for FrameMaker, Word Perfect, PageMaker and
so fort.
LyX tries to stand out in the sense that, although we do support some
DTP features, we try to emphasize the logical aspect of writing. All
the DTP features are hidden away in a dialog somewhere, so that hopefully
the user will think twice before increasing the font size, and chose
the "logical" solution of using a paragraph environment instead.
By making it much easier to use the logical method, the user will hopefully
prefer this one.
And for all of these reasons, even if it might be possible to write a
filter that will import Word documents, and convert all the DTP information
into a form that LyX understands (even though that will be very hard in
itself, and it will mangle information), the end result would typically
be a document which does not fit into the LyX document model.
Unless the Word document was strictly written to be interchangable with
the LyX document model, you would have problems.
So, in essence, what one could do, is to try to focus on one aspect
of the problem instead of the entire problem. For instance, it would
make great sense to chose a middle format which supports a few logical
elements that are highly unambigious, and in a sense common to all
word processors.
One useful goal would be to be able to extract the text of a document,
and more or less ignore the formatting, which is bound to be different
from system to system anyway.
I porpose that such a middle format exists in the form of "clean"
HTML. I.e. it's HTML where you don't use tables for layout-purposes,
and where all <font> tags are left out, and so on.
Alternatively, a stripped down version of DocBook could serve the
purpose as well, but since Word is not likely to support this, this
is more problematic.
So, one approach would be to evaluate the HTML output from Word, and
if it uses things like <h1> and <dl> as representations of the corresponding
constructs, it would make sense to write an HTML import filter for LyX
and let that serve as the "best" middle ground, that would convert a
document from Word into a form which is useful inside LyX (and back of
course).
Greets,
Asger