Hong-Thai,
Thank you for running these tests. I suspect (mea culpa) that the increase
in PDF runtime exception failures was caused by PDFBOX-1803/TIKA-1233, which
was not fixed before 1.5 was cut.
I recently made major modifications to the metadata extraction components of
the PDFParser (TIKA-1232 and TIKA-1252). If you have time, would you mind
rerunning these tests with trunk on your test corpus? I'd be interested to see
if the temporary fix to TIKA-1233 lowers the number of PDF runtime exception
failures, and I'd be very interested to see if there are any surprises caused
by 1232 and 1252.
Thank you!
Best,
Tim
-----Original Message-----
From: Hong-Thai Nguyen [mailto:[email protected]]
Sent: Monday, March 03, 2014 8:19 AM
To: [email protected]
Subject: Tika 1.5 vs 1.4 testing
Hi all,
I've checked on same corpus. Here's the comparaison :
||Tika||POI||PDFbox||Failed docs||
|1.4|3.9|1.8.1|92|
|1.5|3.10-beta2|1.8.4|182|
========================== TIKA 1.4 ========================================
- pdf (7)
* (1)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@4d39a96c
* (3)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@4d39a96c
* (3)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (8)
* (7)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Error creating OOXML extractor
* (1)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@4db190a5
- doc (2)
* (2)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- ppt (40)
* (39)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
* (1)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- xls (9)
* (7)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
* (2)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- dwg (4)
* (4)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version:
AC1014
- odp (2)
* (2)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@7286f080
- rtf (13)
* (13)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@455a7af4
- pps (5)
* (5)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
========================== TIKA 1.5 ========================================
- pdf (16)
* (10)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@1e59efa5
* (3)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@1e59efa5
* (3)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (19)
* (7)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Error creating OOXML extractor
* (12)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@2b195ebd
- doc (11)
* (9)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@7b796022
* (2)
com.polyspot.document.converter.ConversionException: org.xml.sax.SAXException:
Namespace http://www.w3.org/1999/xhtml not declared
- ppt (47)
* (46)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@7b796022
* (1)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@7b796022
- xls (9)
* (7)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@7b796022
* (2)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@7b796022
- xlsx (28)
* (28)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@2b195ebd
- dwg (4)
* (4)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version:
AC1014
- odp (2)
* (2)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@3dc15f75
- rtf (39)
* (35)
com.polyspot.document.converter.ConversionException: org.xml.sax.SAXException:
Namespace http://www.w3.org/1999/xhtml not declared
* (4)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@101e1163
- pps (7)
* (7)
com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@7b796022
Regards,
Hong-Thai