[ https://issues.apache.org/jira/browse/TIKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075006#comment-16075006 ]
Luis Filipe Nassif commented on TIKA-2419: ------------------------------------------ Hi Nick, The original issue of eml(x) being detected as html I solved increasing the magic priority of eml(x) instead of decreasing html priority. Maybe that is a possible simpler approach. > Try HTML mime magic on broken XML files > --------------------------------------- > > Key: TIKA-2419 > URL: https://issues.apache.org/jira/browse/TIKA-2419 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.15 > Reporter: Nick Burch > > As noticed from the latest common crawl work, some url-hosted HTML files are > being detected as text/plain then specialised out to their programming > language url extension > This is caused broken XML in the HTML, and by us having dropped the magic > priority of HTML to 40 (below that of XML), to avoid it matching for > HTML-containing other types like emails. Because these files have broken XML > (eg an empty encoding on the xml tag), the XML root extractor doesn't run, > and they get downmixed to text plain then specialised by filename -- This message was sent by Atlassian JIRA (v6.4.14#64029)