[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758867#comment-15758867 ]
David Pilato edited comment on TIKA-2208 at 12/18/16 1:50 PM: -------------------------------------------------------------- So we now have a regression in Elasticsearch tests. We are testing that Tika test files are working correctly. For that we are using a subset of https://github.com/apache/tika/tree/master/tika-parsers/src/test/resources/test-documents Here, before we excluded {{x-tika-ooxml}} we were able to parse {{testPPT.potm}} file. After applying the exclusion, the document is coming back empty. Before the change, that was extracted: {code} Attachment Test Rajiv This is a test file data with the same content as every other file being tested for tika content parsing. This has been developed by Rajiv Kumar Nistala. Different words to test against Quest Hello Watershed Avalanche Black Panther Mystery Banking Investment {code} I think I'm just going to add the missing librairies as I don't think I can only exclude Visio content, right? was (Author: dadoonet): So we now have a regression in Elasticsearch tests. We are testing that Tika test files are working correctly. For that we are using a subset of https://github.com/apache/tika/tree/master/tika-parsers/src/test/resources/test-documents Here, before we excluded {{x-tika-ooxml}} we were able to parse {{testPPT.potm}} file. After applying the exclusion, the document is coming back empty. Before the change, that was extracted: {{code}} Attachment Test Rajiv This is a test file data with the same content as every other file being tested for tika content parsing. This has been developed by Rajiv Kumar Nistala. Different words to test against Quest Hello Watershed Avalanche Black Panther Mystery Banking Investment {{code}} I think I'm just going to add the missing librairies as I don't think I can only exclude Visio content, right? > Catch missing libraires > ----------------------- > > Key: TIKA-2208 > URL: https://issues.apache.org/jira/browse/TIKA-2208 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: David Pilato > > Hi there > We have decided to remove support for some formats when using Tika to extract > text and metadata. > We defined our list of Parsers: > {code:java} > private static final Parser PARSERS[] = new Parser[] { > // documents > new org.apache.tika.parser.html.HtmlParser(), > new org.apache.tika.parser.rtf.RTFParser(), > new org.apache.tika.parser.pdf.PDFParser(), > new org.apache.tika.parser.txt.TXTParser(), > new org.apache.tika.parser.microsoft.OfficeParser(), > new org.apache.tika.parser.microsoft.OldExcelParser(), > new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), > new org.apache.tika.parser.odf.OpenDocumentParser(), > new org.apache.tika.parser.iwork.IWorkPackageParser(), > new org.apache.tika.parser.xml.DcXMLParser(), > new org.apache.tika.parser.epub.EpubParser(), > }; > private static final AutoDetectParser PARSER_INSTANCE = new > AutoDetectParser(PARSERS); > private static final Tika TIKA_INSTANCE = new > Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE); > {code} > But when a MS Office Word document embeds another non supported document > (Like a Visio Schema) an {{NoClassDefFoundError}} is raised. > Would it be possible to catch such a case and throw in that case a > {{TikaException}} so it behaves as an Exception and not as a Throwable? -- This message was sent by Atlassian JIRA (v6.3.4#6332)