[ 
https://issues.apache.org/jira/browse/TIKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075006#comment-16075006
 ] 

Luis Filipe Nassif commented on TIKA-2419:
------------------------------------------

Hi Nick,

The original issue of eml(x) being detected as html I solved increasing the 
magic priority of eml(x) instead of decreasing html priority. Maybe that is a 
possible simpler approach. 

> Try HTML mime magic on broken XML files
> ---------------------------------------
>
>                 Key: TIKA-2419
>                 URL: https://issues.apache.org/jira/browse/TIKA-2419
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.15
>            Reporter: Nick Burch
>
> As noticed from the latest common crawl work, some url-hosted HTML files are 
> being detected as text/plain then specialised out to their programming 
> language url extension
> This is caused broken XML in the HTML, and by us having dropped the magic 
> priority of HTML to 40 (below that of XML), to avoid it matching for 
> HTML-containing other types like emails. Because these files have broken XML 
> (eg an empty encoding on the xml tag), the XML root extractor doesn't run, 
> and they get downmixed to text plain then specialised by filename



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to