[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085883#comment-16085883 ]
Luis Filipe Nassif commented on TIKA-2042: ------------------------------------------ See Tika-879. Looks like widening the magic search helped to detect more emls in the test corpus. [~talli...@apache.org] do you remember if that resulted in lots of false positives? > MBOX file detected wrongly as text/html > --------------------------------------- > > Key: TIKA-2042 > URL: https://issues.apache.org/jira/browse/TIKA-2042 > Project: Tika > Issue Type: Bug > Affects Versions: 1.13 > Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the > time of this writing > Reporter: Vjeran Marcinko > Fix For: 1.14 > > Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt > > > MBOX file doesn't get recognized via "magic detection" mechanism as > "application/mbox", but wrongly as "text/html". > Workaround for this in Tika 1.13 is achieved by placing following in > custom-mimetypes.xml, as suggested on mailing list (priority has to be larger > than message/rfc822): > <mime-type type="application/mbox"> > <magic priority="70"> > <match value="From " type="string" offset="0"/> > </magic> > <glob pattern="*.mbox"/> > </mime-type> > Sample MBOX file is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)