[
https://issues.apache.org/jira/browse/TIKA-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452722#comment-17452722
]
matcha007 commented on TIKA-3526:
---------------------------------
this bug may be caused by the office version.i found that the embedded docx,
pptx and xlsx files are called "package" when i used the zip tool to open word
or excel.however,in the code is "Package"!!!
!image-2021-12-03-11-04-38-478.png|width=532,height=98!
!image-2021-12-03-11-05-51-182.png|width=531,height=89!
!image-2021-12-03-11-06-44-697.png|width=526,height=138!
!image-2021-12-03-11-07-33-659.png|width=522,height=105!
debug AbstractPOIFSExtractor.class:111
!image-2021-12-03-11-11-29-649.png|width=513,height=200!
debug AbstractOOXMLExtractor:class:231
!image-2021-12-03-11-15-51-328.png|width=512,height=267!
therefore,i temporarily solved the bug.
{code:java}
package org.apache.tika.parser.microsoft;
abstract class AbstractPOIFSExtractor {
......
protected void handleEmbeddedOfficeDoc(DirectoryEntry dir, String
resourceName, XHTMLContentHandler xhtml) throws IOException, SAXException,
TikaException {
Ole10Native ole;
Entry ooxml = dir.hasEntry("Package") ? dir.getEntry("Package") :
(dir.hasEntry("package") ? dir.getEntry("package") : null);
if (null != ooxml) {
......
}
}
......
}
{code}
{code:java}
package org.apache.tika.parser.microsoft.ooxml;
public abstract class AbstractOOXMLExtractor implements OOXMLExtractor {
......
private void handleEmbeddedOLE(PackagePart part, ContentHandler handler,
String rel, Metadata parentMetadata) throws IOException, SAXException {
......
if ((root.hasEntry("\u0001Ole") &&
root.hasEntry("\u0001CompObj") && (root.hasEntry("CONTENTS") ||
root.hasEntry("Package"))) || root.hasEntry("package")) {
......
}
......
}
{code}
> i cant extract content from attachments in the document
> -------------------------------------------------------
>
> Key: TIKA-3526
> URL: https://issues.apache.org/jira/browse/TIKA-3526
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.20
> Reporter: matcha007
> Priority: Major
> Attachments: TIKA-3526.pptx, embedded attachment.doc, embedded
> attachment.docx, embedded attachment.ppt, embedded attachment.pptx, embedded
> attachment.xls, embedded attachment.xlsx, image-2021-12-03-11-04-38-478.png,
> image-2021-12-03-11-05-51-182.png, image-2021-12-03-11-06-44-697.png,
> image-2021-12-03-11-07-33-659.png, image-2021-12-03-11-11-29-649.png,
> image-2021-12-03-11-15-51-328.png
>
>
> office series documents contain office series document attachment. Can the
> contents of the attachments be extracted as shown in the table below
>
> | |doc|docx|xls|xlsx|ppt|pptx|
> |txt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pdf|(/)|(/)|(/)|(/)|(x)|(/)|
> |xml|(/)|(/)|(/)|(/)|(x)|(/)|
> |doc|(/)|(/)|(/)|(/)|(x)|(/)|
> |docx|(x)|(/)|(/)|(/)|(x)|(/)|
> |xls|(/)|(/)|(/)|(/)|(x)|(/)|
> |xlsx|(/)|(/)|(x)|(x)|(x)|(x)|
> |ppt|(/)|(/)|(/)|(/)|(x)|(/)|
> |pptx|(/)|(/)|(/)|(/)|(x)|(/)|
>
> 1.If our use method is wrong, please help us use the correct way
> {code:java}
> File file = new File("XX");
> Parser parser = new OfficeParser();
> ParseContext context = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(HttpHeaders.CONTENT_ENCODING, "GB18030");
> metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
> parser.parse(inputStream, handler, metadata, context);
> {code}
>
> 2.We use Tika version: 1.20. Of course, we have replaced the latest version
> 2.0. This problem still exists.
>
> 3.If there is indeed this omission in the current version, please help us
> optimize it in subsequent versions
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)