On 3 Aug 2014, at 1:57 am, jan i <j...@apache.org> wrote: > I too am on peter fast rolling waggon :-) but I am also confused. > > @peter maybe you could explain a couple of things, for non-document > specialists: > > 1) Following your thought, with biderectional editors. Why would a editor > have a home format ?
There's two ways to view a format: (1) as a way of encoding information for storage or transmission, and (2) as an in-memory data structure used by the editor at runtime. In some programs these are two different things, and in others they are the same. The latter is true of web browsers - HTML is both the file format and the runtime data model; the W3C DOM APIs can be used to manipulate the HTML structure directly. I believe this was also true to a large extent with the binary formats used by older versions of MS Office, for purposes of efficiency [1]. I'm not familiar with the internals of OpenOffice - one thing I'd be very interested to know is does it use ODF for it's in-memory representation of the document? Or are the runtime data structures used different to the XML trees that one finds in an ODF package? > Following your thought to the end, the editor would always save/read in the > format, and things not supported in the format with be saved as private. The issue of how to handle features not supported by the format is a tricky one. My initial view is that those features are best disabled if the user chooses to save in that format (or alternatively a warning message shown on save), since even if there were private extensions saved in the foreign format, they won't be supported in other apps, and are not guaranteed to be preserved (see further below). > 2) When editing in format foo, one can expect that not all features are > supported (like e.g. microsoft macros), these are handled as private > containers. > > But looking at LO there seems to be huge challenges when doing especially > copy/paste operations ? Yes, this is a very tricky problem. Even with a simple bidirectional transformation model, where you have a 1:1 mapping between elements in the concrete document and elements in the abstract document (concrete = original format, abstract = format used by the editor), it's not possible to know what should be done for elements that have been copied & pasted. One approach would be to make the mapping 1:n, where if an element in the abstract (editable) document is copied & pasted one or more times, then its corresponding element in the concrete document is also duplicated at save time when the file is updated. However, this can potentially violate uniqueness constraints, e.g. if the element being copied is supposed to have a unique identifier, you can't just go making a direct copy of it, as you'd end up with two elements with the same identifier. However, if the implementation was aware of such uniqueness constraints for specific elements it could ensure these are still respected, even if it doesn't support any other aspects of the element (e.g. editing or rendering). Cut & paste is much easier to handle though as it's equivalent to a move operation, which doesn't have any implications for uniqueness constraints. > 3) If we save private info in .docx, how can be be sure that a microsoft > editor does not destroy it ? > > Does the standard contain some rules about keeping private information ? Well, we can never be *completely* sure that a microsoft editor won't destroy something ;) Having said that though, there are a couple of provisions for this. One is simply the ability to include extra files in the package, labeled with a particular namespace. Each OOXML package contains a "relationship graph", which is a separate data structure from the zip file's directory hierarchy, and is what OOXML uses to identify "parts" (files) within the package. In principle, there should be no problem with simply adding an extra part with whatever namespace you like, and that being preserved. However, this isn't guaranteed if an implementation does an import/export, since usually any extra information gets lost on import and is no longer there by the time export occurs. I've just done a test on this in fact, to see how different implementations handle it. I added an extra XML file to a package, and referenced it from the relationships graph. Under Word 2011 and Word 2013, this file was preserved after modification. Under LibreOffice Writer however, the file disappeared from the package after a save. I suspect this is due to the file being imported into either ODF of LibreOffice's own internal data model, and thus the extra information being missing on save (if any of the LO developers are reading this... perhaps you can comment here). Ironically the warning message LO displayed when I tried to save the file was 'This document may contain formatting or content that cannot be saved in the currently selected file format "Microsoft Word 2007/2010 XML". Use the default ODF file format to be sure that the document is saved correctly". In fact, in this instance, the exact opposite is the case - the information *could* be saved in OOXML (if it were not previously lost on import), but could *not* be saved in ODF. I think this is a good example of why bidirectional transformation is so important for achieving true compatibility - since it means you *don't* lose information on save. The fact that it works in MS Office is possibly more luck than anything else, since it wouldn't need to do an import. The second way in which OOXML caters for foreign extensions is a set of XML elements which can be used to indicate how a consumer should treat content it doesn't know about. This is described in part 3 of the spec, "Markup Compatibility and Extensibility (MCE)". Essentially this provides a way of saying to a consumer "hey, I've got this extra info in a custom format, and you should use that if you support the particular namespace; otherwise, here's some fallback content you can use instead". It also lets you say to the consumer "just ignore elements in this namespace if you don't support it". Unfortunately however, I don't believe there's any guarantee that these are preserved either. In the case of UX Write, where there is a piece of content stored in multiple formats, it just throws away the ones it doesn't support (one of the few cases in which UX Write's .docx support is not fully bidirectional). This is something I should arguably fix, as potentially there may be useful information lost. The only instance I've seen it used in practice though is where there's a new, proprietary feature introduced in a later version of Office; e.g. in Word 2010 or later if you draw a circle in your document, it will (and I'm not making this up) store two versions of the circle - one a special Word 2010 namespace which is not defined in the OOXML spec, and another representation of the circle in the older VML format (which for some reason mainly consists of a "o:gfxdata" attribute containing binary data encoded in base 64 - but hey, at least it's in XML, right? ;) To summarise, I think that storing private/extension information in a foreign file format should be considered unreliable, since implementations tend to differ a lot on their support for this. Therefore, one should do so if there's no major consequence to losing that information. It also kind of goes against the idea of having a standard in the first place. [1] http://www.joelonsoftware.com/items/2008/02/19.html -- Dr. Peter M. Kelly kelly...@gmail.com http://www.kellypmk.net/ PGP key: http://www.kellypmk.net/pgp-key (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
signature.asc
Description: Message signed with OpenPGP using GPGMail