Re: DocFormats - Open source OOXML implementation

Peter Kelly Fri, 15 Aug 2014 18:51:07 -0700

On 16 Aug 2014, at 5:26 am, Andrea Pescetti <pesce...@apache.org> wrote:

> On 15/08/2014 Peter Kelly wrote:
>> Those of you interested in OOXML may want to have a look at my own
>> implementation of (a subset of) the spec, which is part of a library
>> I've just made available as open source (license is ASLv2):
>> https://github.com/uxproductivity/DocFormats
> 
> It's very interesting. I hope that in future it may become relevant to 
> OpenOffice or to Apache at large.
> 
>> The design is based on bidirectional transformation, as a way of
>> achieving non-destructive editing of foreign file formats. This permits
>> incremental implementation of a given spec without risking data loss due
>> to incomplete features, since unsupported features of a given file
>> format are left untouched on save.
> 
> Does this mean that
> $ dfutil/dfutil filename.docx filename.html
> $ dfutil/dfutil filename.html filename2.docx
> should produce a "filename2.docx" that is quite similar to "filename.docx"? 
> It is failing rather badly (invalid OOXML output in the second conversion, 
> ZIP container clearly missing files and possible breaking order) in a simple 
> test I did with a 1-page docx file.

I'm not surprised this is the first issue to come up :$ There's a *lot* of 
knowledge I need to document for others; questions from you and others are the 
best way to motivate me to get that written ;)

What's happening here is that when the filename.html produced in the first 
step, each of its elements contains an id attribute containing a numeric 
identifier that refers to a specific element in the source docx file 
(specifically, the word/document.xml file within the package). These numeric 
identifiers are generated during parsing, and correspond to the position of the 
element in document order (so 1, 2, 3, etc.). When you convert from HTML to 
.docx, it uses the id attributes to re-establish these relationships, so that 
it knows which elements in the HTML file correspond to which elements in the 
.docx file.

The problem you encountered stems from the fact that this mapping is only valid 
in specific circumstances - that is, when the .docx file being updated is 
exactly the same as its original. If this is not the case, then the identifier 
assigned to a given node will different whenever there are other nodes that 
have been inserted between it. So for example if you do the following:

dfutil filename.docx filename.html
# Modify filename.html
dfutil filename.html filename.docx
dfutil filename.html filename.docx

Then the third run will fail, because in the second the docx file will have 
been updated based on the changes in the HTML, changing the sequence numbers 
assigned to each node, and then on the second run the mapping will be valid. 
The conversion works on the assumption that the docx file is the same as the 
original. The way that UX Write uses the library, it ensures this is the case, 
but the library does not check for this (and yes, it should; more on this 
below).

Your case is similar, though in this case you're creating a new docx file, not 
updating an existing one. However what it actually does in this case is to 
create an empty .docx file, and then "update" that based on the HTML. In doing 
so, it assumes that the HTML does not contain any mappings (that is, id 
attributes with the prefix "bdt"). Since the filename.html you generated does, 
it tries to map these to elements in the docx file, failing badly.

The only workaround for this at present is to manually edit the HTML file and 
remove all id attributes. The quickest way to do this is with the following 
command:

sed -i '' -E ' s/ id="word[0-9]+"//' filename.html

Then, when you run dfutil, it will see that there is no mapping for any of the 
elements in the HTML file, and thus avoid the problems in the output you 
observed.

Now, onto the fix:

The library needs to have some way of checking that the HTML file being used as 
part of an update operation has a mapping (id attributes) that match the docx 
file being updated (in the case of creating a new file, this is just an empty 
docx file). In the even that this is not the case, it could still do the 
update, but would act as if the entire document had been replaced with a 
completely new one.

The solution I'll likely implement (and this should really be my first task, 
given the potential for problems like the above is this):

- Include a hash of the .docx file (or relevant parts of it) in the HTML file, 
e.g. as a meta element or as part of the prefix on all id attributes
- On update, have re-compute the hash of the .docx file and compare it against 
the one stored in the HTML file (if any), and if there's no match, treat the 
HTML file as a complete replacement of all content

> 
> What is the best channel to report issues?

--
Dr. Peter M. Kelly
Founder, UX Productivity
pe...@uxproductivity.com
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: DocFormats - Open source OOXML implementation

Reply via email to