[
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519774#comment-17519774
]
Tim Allison commented on TIKA-3711:
-----------------------------------
In reviewing the commit above, there were quite a few places where outputHtml
should have been false. I've fixed those.
The general need still remained though, to allow users to turn off the
reporting of embedded file names in the handler's content. I've now made the
EmbeddedDocumentExtractor configurable from tika-config.xml.
This is an example of how to do that:
{code}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
</parsers>
<autoDetectParserConfig>
<embeddedDocumentExtractorFactory
class="org.apache.tika.extractor.ParsingEmbeddedDocumentExtractorFactory">
<params>
<writeFileNameToContent>false</writeFileNameToContent>
</params>
</embeddedDocumentExtractorFactory>
</autoDetectParserConfig>
</properties>
{code}
> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.3.0
> Reporter: Sam Stephens
> Priority: Major
> Attachments: word-doc-with-image-from-word-365.docx,
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)