It is important to understand that an XML DOM does not capture all of the constraints and referential requirements within an ODF document. In particular, content.xml does not have everything and there are references using XLink (relative hrefs) and also special identifiers (not IDREFs) to other files, whether for binary attachments or into other defined parts (styles.xml and meta.xml for two).
There is also considerable internal structuring that is off-hierachy. Some of the connections are via fragment IDs (xml:id) and IDREFs, others are by identifiers (not IDs and IDREFs) that are introduced in the ODF specification but which are not modelled in the Relax NG Schema (beyond saying they have string values, for example). This sort of thing also happens rather heavily in OOXML, where communication among parts uses a unique cross-part relationship model. There are also many cross references to named components by other than XML IDs and IDREFs, whether or not the components and the references occur in the same part of the OPC package. One could continue the kind of hack that plants that information as benign markers into an internal form of the XML parts (even as a single XML document, although that is tricky when ODF documents are nested as subdocuments of another), so long as they are replaced when the XML document is committed to a saved ODF document file format. In terms of having a DOM that maps to the external file form and a different internal model, the only time that the internal model needs to update the externally-oriented DOM is as part of a Save operation. There might be more coupling, but performance and storage issues will doubtless impact the engineering outcome, especially for handling large documents with alacrity. Copy and paste and undo management will also be factors, along with maintaining pagination, word counts, and such. On the other hand, it is convenient (practically necessary) to specify the semantics of ODF, or some profile of ODF, as if operations are on the format itself, since it is only the format that is more-or-less well-specified. It would be interesting to know how much this could be taken literally in an application. I think there might be forensic tools on ODF documents that might be able to operate that way. I'm not at all certain about production WYSIWYG consumers and producers, especially ones implemented to harmonize between OOXML, ODF and other interesting formats (EPUB coming to mind). I will watch Peter Kelly's efforts with great interest to see how much the boundaries can be moved in this area. -- Dennis E. Hamilton dennis.hamil...@acm.org +1-206-779-9430 https://keybase.io/orcmid PGP F96E 89FF D456 628A X.509 certs used and requested for signed e-mail ----- Original Message --- From: Peter Kelly [mailto:kelly...@gmail.com] Sent: Monday, August 4, 2014 01:27 To: dev@openoffice.apache.org Subject: Re: OOXML On 4 Aug 2014, at 12:16 am, jan i <j...@apache.org> wrote: [ ... ] It's possible in theory, though I'm not familiar enough with the OO codebase to say whether it would work in practice. The key idea is to maintain two separate data structures - one which is the ODF XML trees, and another which is the internal representation. Any time a change gets made to the former, the implementation must update the latter to reflect the change. Modification operations on the latter would need to go in the other direction. [ ... ] In the case of UX Write, there's a few instances where I've used custom extensions to handle certain things. The main ones are: 1. Table of contents/list of tables/list of figures. When you insert one of these into your document, it inserts a <nav> element with a CSS class name of "tableofcontents", "listoffigures", or "listoftables", which were chosen as these are the same keywords that LaTeX uses for these features. UX Write treats these as having special meaning, in the sense that when opening a document (and when the document is modified), it updates the content of these <nav> elements based on the set of all heading, figure, or table elements in the document (including numbering/captions). 2. OOXML-specific features. When converting from .docx to .html during the process of opening a document, it assigns certain pre-defined CSS class names to particular types of HTML elements to indicate their purpose. For example, a cross-reference whose display format is supposed to include both the label and caption of a figure will be translated as: [ ... ] --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org