Tika CLI --detect returns incorrect content-type for files with altered 
extensions
----------------------------------------------------------------------------------

                 Key: TIKA-786
                 URL: https://issues.apache.org/jira/browse/TIKA-786
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 1.1
         Environment: Windows
            Reporter: John Mastarone
            Priority: Minor


>From a discussion on the user mailing list on Nov. 11 2011, where the 
>following was requested as a new bug: Tika CLI will return incorrect content 
>type information when called with --detect for files that have had their 
>extensions modified (and nothing else).  MS Word (.doc) documents that have 
>their extension changed to .xls or .ppt will be incorrectly detected as Excel 
>or PowerPoint documents, whereas the --metadata option will determine the 
>content type correctly (as application/msword), based on the actual contents 
>of these mis-named files.  The same also occurs with other types of MS Office 
>2003 documents, and could possibly occur with a wide range of document types.  
>To quote Nick B., from the user mailing list: "If you look at the 
>TestMediaTypes class you'll see what you can get with just the mime magic and 
>filenames, and then there's TestContainerAwareDetector which shows the correct 
>detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to