Hong-Thai,
  Thank you for running these tests.  I suspect (mea culpa) that the increase 
in PDF runtime exception failures was caused by PDFBOX-1803/TIKA-1233, which 
was not fixed before 1.5 was cut.
  I recently made major modifications to the metadata extraction components of 
the PDFParser (TIKA-1232 and TIKA-1252).  If you have time, would you mind 
rerunning these tests with trunk on your test corpus?  I'd be interested to see 
if the temporary fix to TIKA-1233 lowers the number of PDF runtime exception 
failures, and I'd be very interested to see if there are any surprises caused 
by 1232 and 1252.
  Thank you!

         Best,

               Tim


-----Original Message-----
From: Hong-Thai Nguyen [mailto:[email protected]] 
Sent: Monday, March 03, 2014 8:19 AM
To: [email protected]
Subject: Tika 1.5 vs 1.4 testing

Hi all,

I've checked on same corpus. Here's the comparaison :
||Tika||POI||PDFbox||Failed docs||
|1.4|3.9|1.8.1|92|
|1.5|3.10-beta2|1.8.4|182|

========================== TIKA 1.4 ========================================
                - pdf (7)
                               * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
                               * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
                               * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
                - pptx (8)
                               * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
                               * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4db190a5
                - doc (2)
                               * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
                - ppt (40)
                               * (39) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
                               * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
                - xls (9)
                               * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
                               * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
                - dwg (4)
                               * (4) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: 
AC1014
                - odp (2)
                               * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@7286f080
                - rtf (13)
                               * (13) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@455a7af4
                - pps (5)
                               * (5) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2

========================== TIKA 1.5 ========================================
                - pdf (16)
                               * (10) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
                               * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
                               * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
                - pptx (19)
                               * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
                               * (12) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@2b195ebd
                - doc (11)
                               * (9) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@7b796022
                               * (2) 
com.polyspot.document.converter.ConversionException: org.xml.sax.SAXException: 
Namespace http://www.w3.org/1999/xhtml not declared
                - ppt (47)
                               * (46) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@7b796022
                               * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@7b796022
                - xls (9)
                               * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@7b796022
                               * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@7b796022
                - xlsx (28)
                               * (28) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@2b195ebd
                - dwg (4)
                               * (4) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: 
AC1014
                - odp (2)
                               * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@3dc15f75
                - rtf (39)
                               * (35) 
com.polyspot.document.converter.ConversionException: org.xml.sax.SAXException: 
Namespace http://www.w3.org/1999/xhtml not declared
                               * (4) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@101e1163
                - pps (7)
                               * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@7b796022


Regards,

Hong-Thai

Reply via email to