[ https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-2579: ------------------------------ Summary: Update to PDFBox 2.0.9 when available (was: Update to PDFBox 2.0.9) > Update to PDFBox 2.0.9 when available > ------------------------------------- > > Key: TIKA-2579 > URL: https://issues.apache.org/jira/browse/TIKA-2579 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.17 > Reporter: David Pilato > Priority: Major > > Hey team > > We got this report in elasticsearch ingest attachment project: > [https://github.com/elastic/elasticsearch/issues/27198] > Basically when a font is not available PDFBox is throwing an exception like > {{2017/10/31 00:01:13.348 [WARN ] [elasticsearch[test][bulk][T#3]] > [FontManager] Font not found: TimesNewRomanPS-BoldMT 2017/10/31 00:01:13.413 > [ERROR] [elasticsearch[test][bulk][T#3]] [TrueTypeFont] An error occured when > reading table cmap java.io.IOException: CMap subtype 14 not yet implemented > at > org.apache.fontbox.ttf.CMAPEncodingEntry.processSubtype14(CMAPEncodingEntry.java:304) > at > org.apache.fontbox.ttf.CMAPEncodingEntry.initSubtable(CMAPEncodingEntry.java:114) > at org.apache.fontbox.ttf.CMAPTable.initData(CMAPTable.java:100) at > org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280) at > org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128) > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80) at > org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109) > at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at > org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84) > at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25) at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673) > at > org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231) > at > org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533) > at > org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) > at > org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) > at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at > org.apache.tika.Tika.parseToString(Tika.java:537)}} > This might have been solved by PDFParser with > https://issues.apache.org/jira/browse/PDFBOX-3997 which is available in > PDFBox 2.0.9 but Tika 1.17 is still using 2.0.8. See related issue > https://issues.apache.org/jira/browse/PDFBOX-3985. Unclear if that will > actually fix the problem reported but FWIW upgrading to 2.0.9 of PDFBox could > be useful. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)