On 16 Aug 2014, at 5:26 am, Andrea Pescetti <pesce...@apache.org> wrote:
> On 15/08/2014 Peter Kelly wrote: >> Those of you interested in OOXML may want to have a look at my own >> implementation of (a subset of) the spec, which is part of a library >> I've just made available as open source (license is ASLv2): >> https://github.com/uxproductivity/DocFormats > > It's very interesting. I hope that in future it may become relevant to > OpenOffice or to Apache at large. > >> The design is based on bidirectional transformation, as a way of >> achieving non-destructive editing of foreign file formats. This permits >> incremental implementation of a given spec without risking data loss due >> to incomplete features, since unsupported features of a given file >> format are left untouched on save. > > Does this mean that > $ dfutil/dfutil filename.docx filename.html > $ dfutil/dfutil filename.html filename2.docx > should produce a "filename2.docx" that is quite similar to "filename.docx"? > It is failing rather badly (invalid OOXML output in the second conversion, > ZIP container clearly missing files and possible breaking order) in a simple > test I did with a 1-page docx file. I'm not surprised this is the first issue to come up :$ There's a *lot* of knowledge I need to document for others; questions from you and others are the best way to motivate me to get that written ;) What's happening here is that when the filename.html produced in the first step, each of its elements contains an id attribute containing a numeric identifier that refers to a specific element in the source docx file (specifically, the word/document.xml file within the package). These numeric identifiers are generated during parsing, and correspond to the position of the element in document order (so 1, 2, 3, etc.). When you convert from HTML to .docx, it uses the id attributes to re-establish these relationships, so that it knows which elements in the HTML file correspond to which elements in the .docx file. The problem you encountered stems from the fact that this mapping is only valid in specific circumstances - that is, when the .docx file being updated is exactly the same as its original. If this is not the case, then the identifier assigned to a given node will different whenever there are other nodes that have been inserted between it. So for example if you do the following: dfutil filename.docx filename.html # Modify filename.html dfutil filename.html filename.docx dfutil filename.html filename.docx Then the third run will fail, because in the second the docx file will have been updated based on the changes in the HTML, changing the sequence numbers assigned to each node, and then on the second run the mapping will be valid. The conversion works on the assumption that the docx file is the same as the original. The way that UX Write uses the library, it ensures this is the case, but the library does not check for this (and yes, it should; more on this below). Your case is similar, though in this case you're creating a new docx file, not updating an existing one. However what it actually does in this case is to create an empty .docx file, and then "update" that based on the HTML. In doing so, it assumes that the HTML does not contain any mappings (that is, id attributes with the prefix "bdt"). Since the filename.html you generated does, it tries to map these to elements in the docx file, failing badly. The only workaround for this at present is to manually edit the HTML file and remove all id attributes. The quickest way to do this is with the following command: sed -i '' -E ' s/ id="word[0-9]+"//' filename.html Then, when you run dfutil, it will see that there is no mapping for any of the elements in the HTML file, and thus avoid the problems in the output you observed. Now, onto the fix: The library needs to have some way of checking that the HTML file being used as part of an update operation has a mapping (id attributes) that match the docx file being updated (in the case of creating a new file, this is just an empty docx file). In the even that this is not the case, it could still do the update, but would act as if the entire document had been replaced with a completely new one. The solution I'll likely implement (and this should really be my first task, given the potential for problems like the above is this): - Include a hash of the .docx file (or relevant parts of it) in the HTML file, e.g. as a meta element or as part of the prefix on all id attributes - On update, have re-compute the hash of the .docx file and compare it against the one stored in the HTML file (if any), and if there's no match, treat the HTML file as a complete replacement of all content > > What is the best channel to report issues? -- Dr. Peter M. Kelly Founder, UX Productivity pe...@uxproductivity.com http://www.uxproductivity.com/ http://www.kellypmk.net/ PGP key: http://www.kellypmk.net/pgp-key (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
signature.asc
Description: Message signed with OpenPGP using GPGMail