[ 
https://issues.apache.org/jira/browse/TIKA-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527822#comment-17527822
 ] 

Ross Johnson commented on TIKA-3732:
------------------------------------

I took a quick look at the attached file in a hex editor and can confirm that 
it is indeed an RTF file despite the file extension being .DOC. It appears that 
Tika is detecting the type correctly.

> Word doc MediaType detected as RTF
> ----------------------------------
>
>                 Key: TIKA-3732
>                 URL: https://issues.apache.org/jira/browse/TIKA-3732
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.2.1
>            Reporter: Caleb Postlethwait
>            Priority: Major
>         Attachments: example.DOC
>
>
> When executing Detector.detect(InputStream input, Metadata metadata) on a 
> particular Word document, we're getting back a MediaType of RTF which has 
> some downstream effects for us.
> Here's the relevant bit of code:
> TikaConfig config = TikaConfigFactory.getTikaConfig();
> Detector detector = config.getDetector();
> Metadata metadata = new Metadata();
> stream = TikaInputStream.get(fis = new FileInputStream(paths));
> metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, paths);
> *MediaType mediaType = detector.detect(stream, metadata);*
> Attaching the file that we came across this issue on.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to