[
https://issues.apache.org/jira/browse/SOLR-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Pugh resolved SOLR-11142.
------------------------------
Resolution: Won't Fix
In Solr 10 we are leveraging either Tika Server (running in it's own seperate
server process) or maybe Tika Pipes (again, running in a seperate JVM).
Please revalidate your issue against Solr 10 with one of those options, and if
it is still present need, happy to work with you on a fix using the new
approach for Tika.
> NotOLE2FileException when adding MSG files with attachments
> -----------------------------------------------------------
>
> Key: SOLR-11142
> URL: https://issues.apache.org/jira/browse/SOLR-11142
> Project: Solr
> Issue Type: Bug
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 5.5.1, 6.6.5, 7.4
> Environment: Not platform related
> Reporter: Olivier Masseau
> Priority: Major
> Labels: doc, msg, office, parser, tika, word
> Attachments: test.msg
>
>
> When adding MSG files which have attachments we systematically get this error:
> {code:java}
> ERROR (qtp1013423070-16) [ x:default] o.a.s.s.HttpSolrCall
> null:org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header
> signature; read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0 - Your file
> appears not to be a valid OLE2 document
> at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:162)
> at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:112)
> at
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:111)
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
> at
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
> at
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:238)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.eclipse.jetty.server.Server.handle(Server.java:499)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> After inspecting SOLR code it seems the problem comes from here:
> In the ExtractingDocumentLoader class we have:
> {code:java}
> context.set(Parser.class, parser);
> {code}
> In our case the parser is an instance of OfficeParser.
> When processing an MSG file, the OutlookExtractor class is used by the
> OfficeParser.
> To process the attachments of the MSG file, the OutlookExtractor calls the
> ParsingEmbeddedDocumentExtractor.
> To parse an attachment, the ParsingEmbeddedDocumentExtractor uses the
> DelegatingParser.
> The DelegatingParser determines the parser to use by just looking at the
> parser set in the context.
> {code:java}
> protected Parser getDelegateParser(ParseContext context) {
> return context.get(Parser.class, EmptyParser.INSTANCE);
> }
> {code}
> So in our case this means that every attachment will be processed with the
> OfficeParser, even if the attachment is not an MsOffice document !
> To make it work correctly, it is an AutoDetectParser that should be set in
> the context when working with MSG files:
> {code:java}
> context.set(Parser.class, new AutoDetectParser());
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]