So getInputStream isn't working...  All values are ffff and infinite-loops.

InputStream in = part.getInputStream();

char c = (char) in.read();

while(c != -1) {
    System.out.println(String.format("%04x", (int) c));
     c = (char) in.read();
}

On Mon, Apr 27, 2015 at 11:49 AM Nick Burch <[email protected]> wrote:

> On Mon, 27 Apr 2015, Michael Nguyen wrote:
> > We convert MS-WORD documents to DOCX using LibreOffice (clunky), but
> > some files are unreadable because they contain invalid UTF-8 characters
> > in the XML that version 1.0 and 1.1 of XML do not like.
>
> Your best long term fix is to report the bug to Apache OpenOffice, get it
> fixed there, then wait for LibreOffice to accept the fix.
>
> > LibreOffice does not care, but we need to read these documents into POI.
> > Short of disassembling the archive file and editing the appropriate XML
> > files in the container, I was wondering if there was a way to edit the
> > PackagePart data for the relevant bits (it's the word/document.xml this
> > is occurring in most frequently).  The PackagePart API makes it unclear
> > how to read the XML into memory and edit, then re-write to the part.
>
> Once you have a PackagePart, call getInputStream() to read the contents.
> Work you want through that updating / fixing things. Possibly use IOUtils
> to get the stream as a byte array. When done, call getOutputStream() and
> write the new contents into it, then save the overall package
>
> If you have invalid XML, you can't fix it at the XML level, you'll need to
> fix it at the byte level
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to