On Mon, 27 Apr 2015, Michael Nguyen wrote:
We convert MS-WORD documents to DOCX using LibreOffice (clunky), but some files are unreadable because they contain invalid UTF-8 characters in the XML that version 1.0 and 1.1 of XML do not like.

Your best long term fix is to report the bug to Apache OpenOffice, get it fixed there, then wait for LibreOffice to accept the fix.

LibreOffice does not care, but we need to read these documents into POI. Short of disassembling the archive file and editing the appropriate XML files in the container, I was wondering if there was a way to edit the PackagePart data for the relevant bits (it's the word/document.xml this is occurring in most frequently). The PackagePart API makes it unclear how to read the XML into memory and edit, then re-write to the part.

Once you have a PackagePart, call getInputStream() to read the contents. Work you want through that updating / fixing things. Possibly use IOUtils to get the stream as a byte array. When done, call getOutputStream() and write the new contents into it, then save the overall package

If you have invalid XML, you can't fix it at the XML level, you'll need to fix it at the byte level

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to