Hello,

I think there's a bug in the |ExtractingRequestHandler|Handler (Tika parser). Some tika's exception are not catch, and the handler return a 0 status, indicating no problem's with that content.

I give a look at the code (Solr 5.1, ExtractingDocumentLoader:221), only TikaException are catch and send back by SolrException.
The problem still remains on Solr 5.5.

Here's the two stacktrace's :

java.io.IOException :
ERROR - 2016-06-10 14:12:03.932; [ centreinffo] org.apache.pdfbox.filter.FlateFilter; FlateFilter: stop reading corrupt stream due to a DataFormatException INFO - 2016-06-10 14:12:03.940; [ centreinffo] org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo] webapp=/solr path=/update/extract params={fmap.content=contenuDocument&uprefix=tika_&literal.pk=document_Régionsetformation_280&wt=javabin&stream.file=/var/local/ci-services/documents/document_Régionsetformation_280&version=2} {add=[document_Régionsetformation_280 (1536759351407017984)]} 0 74
and  java.io.EOFException
ERROR - 2016-06-10 14:10:49.246; [ centreinffo] org.apache.fontbox.ttf.TrueTypeFont; An error occured when reading table hmtx
java.io.EOFException
at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139) at org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62) at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at org.apache.fontbox.ttf.TrueTypeFont.getHorizontalMetrics(TrueTypeFont.java:204) at org.apache.fontbox.ttf.TrueTypeFont.getAdvanceWidth(TrueTypeFont.java:346) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:677) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411) at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:221) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   ...
INFO - 2016-06-10 14:10:50.207; [ centreinffo] org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo] webapp=/solr path=/update/extract params={fmap.content=contenuDocument&uprefix=tika_&literal.pk=document_Régionsetformation_600&wt=javabin&stream.file=/var/local/ci-services/documents/document_Régionsetformation_600&version=2} {add=[document_Régionsetformation_600 (1536759274020012032)]} 0 2061

Regards,
Gilbert Boyreau

Reply via email to