Java does regex just fine, albeit more verbose (when is Java not verbose ;-),
but my main point is that you already have (Java) tools allow you to
have an XML view of the existing HTML manual (tagsoup, etc...). Leave
the parsing to these tools, and concentrate of transforming the
"loose" HTML schema into a more structured XML, probably using XSL as
the language rather than scripting. By adding a little more structure
to the HTML with <div>s, the XML view of the HTML could be complete
enough for robust transformation to XML, and perhaps even robust
enough so that the HTML remains as the official "source" document of
the manual (but stripped of all formatting, which would be added later
in the XML processing pipeline). The main advantage of this would be
that editing HTML using an HTML editor for manual editing can be
easier/nicer and kinda wysiwyg, compared to editing the transformed
XML.
I'm using libraries - I'm not writing my own html tokenizer :).
I'm aiming for a proof of concept script (for echo task) sometime in
the next week (if work doesn't get in the way too much). After that
I'll see how easy a refactoring job will be for making it generic.
From above, you can see that I envision the possibility of the HTML
manual to remain, so it's all the more important that the transform is
robust.
This suggests that the HTML manual is the "one true source" for the
manual, and all other versions are derived from it through some processing.
Talks about the tokenizer being too greedy make me uneasy ;-) Leave
the parsing to existing parsing tool, and just manipulate the
structure of the document once it's been "reformatted" to a SAX event
stream. In this form, it feeds easily and naturally to an XSL
transform pipeline.
That's my view of the whole thing anyway ;-) --DD
From this discussion my understanding is:
1 - Better to use Java + libs - presumably so that an Ant task can be
derived from it (Ant creates it's own manual would be rather nice I'd
have to say)
2 - Conversion util must be robust
3 - Conversion util will be long-lived
4 - Modification of existing HTML to make it easier for conversion util
would probably be a good idea
5 - Structure of XML is as yet undecided - Docbook with RelaxNG has now
been suggested as an alternative to a bespoke xml
Kev
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]