Hello,
I think there's a bug in the |ExtractingRequestHandler|Handler (Tika
parser).
Some tika's exception are not catch, and the handler return a 0 status,
indicating no problem's with that content.
I give a look at the code (Solr 5.1, ExtractingDocumentLoader:221), only
TikaException are catch and send back by SolrException.
The problem still remains on Solr 5.5.
Here's the two stacktrace's :
java.io.IOException :
ERROR - 2016-06-10 14:12:03.932; [ centreinffo]
org.apache.pdfbox.filter.FlateFilter; FlateFilter: stop reading
corrupt stream due to a DataFormatException
INFO - 2016-06-10 14:12:03.940; [ centreinffo]
org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo]
webapp=/solr path=/update/extract
params={fmap.content=contenuDocument&uprefix=tika_&literal.pk=document_Régionsetformation_280&wt=javabin&stream.file=/var/local/ci-services/documents/document_Régionsetformation_280&version=2}
{add=[document_Régionsetformation_280 (1536759351407017984)]} 0 74
and java.io.EOFException
ERROR - 2016-06-10 14:10:49.246; [ centreinffo]
org.apache.fontbox.ttf.TrueTypeFont; An error occured when reading
table hmtx
java.io.EOFException
at
org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)
at
org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62)
at
org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
at
org.apache.fontbox.ttf.TrueTypeFont.getHorizontalMetrics(TrueTypeFont.java:204)
at
org.apache.fontbox.ttf.TrueTypeFont.getAdvanceWidth(TrueTypeFont.java:346)
at
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:677)
at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411)
at
org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:221)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
...
INFO - 2016-06-10 14:10:50.207; [ centreinffo]
org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo]
webapp=/solr path=/update/extract
params={fmap.content=contenuDocument&uprefix=tika_&literal.pk=document_Régionsetformation_600&wt=javabin&stream.file=/var/local/ci-services/documents/document_Régionsetformation_600&version=2}
{add=[document_Régionsetformation_600 (1536759274020012032)]} 0 2061
Regards,
Gilbert Boyreau