Tika CLI --detect returns incorrect content-type for files with altered
extensions
----------------------------------------------------------------------------------
Key: TIKA-786
URL: https://issues.apache.org/jira/browse/TIKA-786
Project: Tika
Issue Type: Bug
Components: cli
Affects Versions: 1.1
Environment: Windows
Reporter: John Mastarone
Priority: Minor
>From a discussion on the user mailing list on Nov. 11 2011, where the
>following was requested as a new bug: Tika CLI will return incorrect content
>type information when called with --detect for files that have had their
>extensions modified (and nothing else). MS Word (.doc) documents that have
>their extension changed to .xls or .ppt will be incorrectly detected as Excel
>or PowerPoint documents, whereas the --metadata option will determine the
>content type correctly (as application/msword), based on the actual contents
>of these mis-named files. The same also occurs with other types of MS Office
>2003 documents, and could possibly occur with a wide range of document types.
>To quote Nick B., from the user mailing list: "If you look at the
>TestMediaTypes class you'll see what you can get with just the mime magic and
>filenames, and then there's TestContainerAwareDetector which shows the correct
>detection happening by using the extra detectors available".
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira