[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519774#comment-17519774
 ] 

Tim Allison commented on TIKA-3711:
-----------------------------------

In reviewing the commit above, there were quite a few places where outputHtml 
should have been false.  I've fixed those. 

The general need still remained though, to allow users to turn off the 
reporting of embedded file names in the handler's content.  I've now made the 
EmbeddedDocumentExtractor configurable from tika-config.xml.

This is an example of how to do that:
{code}
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
  </parsers>
  <autoDetectParserConfig>
    <embeddedDocumentExtractorFactory 
class="org.apache.tika.extractor.ParsingEmbeddedDocumentExtractorFactory">
      <params>
        <writeFileNameToContent>false</writeFileNameToContent>
      </params>
    </embeddedDocumentExtractorFactory>
  </autoDetectParserConfig>
</properties>
{code}


> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
>                 Key: TIKA-3711
>                 URL: https://issues.apache.org/jira/browse/TIKA-3711
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sam Stephens
>            Priority: Major
>         Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to