On 3 Aug 2014, at 1:57 am, jan i <j...@apache.org> wrote:

> I too am on peter fast rolling waggon :-) but I am also confused.
> 
> @peter maybe you could explain a couple of things, for non-document
> specialists:
> 
> 1) Following your thought, with biderectional editors. Why would a editor
> have a home format ?

There's two ways to view a format: (1) as a way of encoding information for 
storage or transmission, and (2) as an in-memory data structure used by the 
editor at runtime. In some programs these are two different things, and in 
others they are the same. The latter is true of web browsers - HTML is both the 
file format and the runtime data model; the W3C DOM APIs can be used to 
manipulate the HTML structure directly. I believe this was also true to a large 
extent with the binary formats used by older versions of MS Office, for 
purposes of efficiency [1].

I'm not familiar with the internals of OpenOffice - one thing I'd be very 
interested to know is does it use ODF for it's in-memory representation of the 
document? Or are the runtime data structures used different to the XML trees 
that one finds in an ODF package?

> Following your thought to the end, the editor would always save/read in the
> format, and things not supported in the format with be saved as private.

The issue of how to handle features not supported by the format is a tricky 
one. My initial view is that those features are best disabled if the user 
chooses to save in that format (or alternatively a warning message shown on 
save), since even if there were private extensions saved in the foreign format, 
they won't be supported in other apps, and are not guaranteed to be preserved 
(see further below).

> 2) When editing in format foo, one can expect that not all features are
> supported (like e.g. microsoft macros), these are handled as private
> containers.
> 
> But looking at LO there seems to be huge challenges when doing especially
> copy/paste operations ?

Yes, this is a very tricky problem. Even with a simple bidirectional 
transformation model, where you have a 1:1 mapping between elements in the 
concrete document and elements in the abstract document (concrete = original 
format, abstract = format used by the editor), it's not possible to know what 
should be done for elements that have been copied & pasted.

One approach would be to make the mapping 1:n, where if an element in the 
abstract (editable) document is copied & pasted one or more times, then its 
corresponding element in the concrete document is also duplicated at save time 
when the file is updated. However, this can potentially violate uniqueness 
constraints, e.g. if the element being copied is supposed to have a unique 
identifier, you can't just go making a direct copy of it, as you'd end up with 
two elements with the same identifier. However, if the implementation was aware 
of such uniqueness constraints for specific elements it could ensure these are 
still respected, even if it doesn't support any other aspects of the element 
(e.g. editing or rendering).

Cut & paste is much easier to handle though as it's equivalent to a move 
operation, which doesn't have any implications for uniqueness constraints.

> 3) If we save private info in .docx, how can be be sure that a microsoft
> editor does not destroy it ?
> 
> Does the standard contain some rules about keeping private information ?

Well, we can never be *completely* sure that a microsoft editor won't destroy 
something ;)

Having said that though, there are a couple of provisions for this. One is 
simply the ability to include extra files in the package, labeled with a 
particular namespace. Each OOXML package contains a "relationship graph", which 
is a separate data structure from the zip file's directory hierarchy, and is 
what OOXML uses to identify "parts" (files) within the package. In principle, 
there should be no problem with simply adding an extra part with whatever 
namespace you like, and that being preserved. However, this isn't guaranteed if 
an implementation does an import/export, since usually any extra information 
gets lost on import and is no longer there by the time export occurs.

I've just done a test on this in fact, to see how different implementations 
handle it. I added an extra XML file to a package, and referenced it from the 
relationships graph. Under Word 2011 and Word 2013, this file was preserved 
after modification. Under LibreOffice Writer however, the file disappeared from 
the package after a save. I suspect this is due to the file being imported into 
either ODF of LibreOffice's own internal data model, and thus the extra 
information being missing on save (if any of the LO developers are reading 
this... perhaps you can comment here).

Ironically the warning message LO displayed when I tried to save the file was 
'This document may contain formatting or content that cannot be saved in the 
currently selected file format "Microsoft Word 2007/2010 XML". Use the default 
ODF file format to be sure that the document is saved correctly". In fact, in 
this instance, the exact opposite is the case - the information *could* be 
saved in OOXML (if it were not previously lost on import), but could *not* be 
saved in ODF. I think this is a good example of why bidirectional 
transformation is so important for achieving true compatibility - since it 
means you *don't* lose information on save. The fact that it works in MS Office 
is possibly more luck than anything else, since it wouldn't need to do an 
import.

The second way in which OOXML caters for foreign extensions is a set of XML 
elements which can be used to indicate how a consumer should treat content it 
doesn't know about. This is described in part 3 of the spec, "Markup 
Compatibility and Extensibility (MCE)". Essentially this provides a way of 
saying to a consumer "hey, I've got this extra info in a custom format, and you 
should use that if you support the particular namespace; otherwise, here's some 
fallback content you can use instead". It also lets you say to the consumer 
"just ignore elements in this namespace if you don't support it".

Unfortunately however, I don't believe there's any guarantee that these are 
preserved either. In the case of UX Write, where there is a piece of content 
stored in multiple formats, it just throws away the ones it doesn't support 
(one of the few cases in which UX Write's .docx support is not fully 
bidirectional). This is something I should arguably fix, as potentially there 
may be useful information lost. The only instance I've seen it used in practice 
though is where there's a new, proprietary feature introduced in a later 
version of Office; e.g. in Word 2010 or later if you draw a circle in your 
document, it will (and I'm not making this up) store two versions of the circle 
- one a special Word 2010 namespace which is not defined in the OOXML spec, and 
another representation of the circle in the older VML format (which for some 
reason mainly consists of a "o:gfxdata" attribute containing binary data 
encoded in base 64 - but hey, at least it's in XML, right? ;)

To summarise, I think that storing private/extension information in a foreign 
file format should be considered unreliable, since implementations tend to 
differ a lot on their support for this. Therefore, one should do so if there's 
no major consequence to losing that information. It also kind of goes against 
the idea of having a standard in the first place.

[1] http://www.joelonsoftware.com/items/2008/02/19.html

--
Dr. Peter M. Kelly
kelly...@gmail.com
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to