https://bz.apache.org/bugzilla/show_bug.cgi?id=60374

            Bug ID: 60374
           Summary: Extracting text from some older Word documents fails
                    with ArrayIndexOutOfBoundsException due to
                    unicode/non-unicode mismatch
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
          Assignee: dev@poi.apache.org
          Reporter: dominik.stad...@gmx.at
  Target Milestone: ---

Created attachment 34447
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34447&action=edit
Sample file

The regression testing at
http://people.apache.org/~centic/poi_regression/reportsAll/ shows the following
for some files.

It seems the text-pieces in the files are stored as non-unicode, but the class
PieceDescriptor sets unicode = true. If I set unicode = false manually there
extracting text works for these documents as well.


    public void testException() throws IOException, OpenXML4JException,
XmlException {
                final POITextExtractor extractor =
ExtractorFactory.createExtractor(POIDataSamples.getDocumentInstance().openResourceAsStream("cn.orthodox.www_divenbog_APRIL_30-APRIL.DOC"));

                // Check it gives text without error
                System.out.println(extractor.getText());

                extractor.close();
        }



java.lang.IllegalArgumentException: Error creating Scratchpad Extractor
        at
o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:197)
        at
o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:119)
        at
o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:276)
        at
o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:129)
        at
o.a.p.stress.AbstractFileHandler.handleExtractingInternal(AbstractFileHandler.java:81)
        at
o.a.p.stress.AbstractFileHandler.handleExtracting(AbstractFileHandler.java:60)
        at
org.dstadler.commoncrawl.FileHandlingRunnable.run(FileHandlingRunnable.java:62)

Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor4560.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at
o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:192)
        ... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
        at o.a.p.hwpf.model.TextPieceTable.(TextPieceTable.java:109)
        at o.a.p.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70)
        at o.a.p.hwpf.HWPFOldDocument.(HWPFOldDocument.java:68)
        at o.a.p.hwpf.extractor.Word6Extractor.(Word6Extractor.java:74)
        at
o.a.p.extractor.OLE2ScratchpadExtractorFactory.createExtractor(OLE2ScratchpadExtractorFactory.java:62)
        ... 16 more

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to