[ 
https://issues.apache.org/jira/browse/TIKA-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922279#comment-17922279
 ] 

Subbu commented on TIKA-4351:
-----------------------------

[~tallison]  : I am interested to pick this up, do we want to make this 
restriction as per [RFC 
2045|https://datatracker.ietf.org/doc/html/rfc2045#section-5.1] (to just allow 
US ASCII) in _forName_ or as more restrictive [RFC 
6838|https://datatracker.ietf.org/doc/html/rfc6838#section-4.2] as the original 
thread points out? 

> More restrictive MIME type validation
> -------------------------------------
>
>                 Key: TIKA-4351
>                 URL: https://issues.apache.org/jira/browse/TIKA-4351
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, mime
>    Affects Versions: 3.0.0
>            Reporter: Sebastian Nagel
>            Priority: Major
>
> Background:
> - [~tallison] started a [discussion on the Common Crawl user 
> group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ] 
> about strange and obviously erroneous "identified" MIME types in Common Crawl 
> data which were identified in Nutch using Tika's magic detector. See 
> [o.a.nutch.util.MimeUtil#autoResolveContentType|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L153]
>  for the source code.
> - the issue is tracked on Nutch's site in NUTCH-3089
> - however, implementing a complex MIME type validation seems out of Nutch's 
> scope and is eventually better done and maintained by Tika
> While looking at more examples, digging deeper and trying to improve the 
> detection code in Nutch, I came up with the following points regarding the 
> validation of the MIME type in 
> [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)].
>  The method is used both from Nutch and Tika (in 
> [MimeTypes#detect(...)|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#detect(java.io.InputStream,org.apache.tika.metadata.Metadata)]):
> - "forName" accepts non-ASCII Unicode characters as part of the MIME type 
> ({{foo/bär}}) - not covered by [RFC 
> 2045|https://datatracker.ietf.org/doc/html/rfc2045#section-5.1] which allows 
> only US_ASCII characters. Of course, one might argue, that already the HTTP 
> header parser should filter such headers away, but ...
> - the grammar in RFC 2045 is lazy interpreted, that is a type or subtype may 
> include the allowed characters in any order
>   -- (sub)types not registered at IANA are accepted even if they do not start 
> with "x-" / "X-" / "x."
>   -- [RFC 6838|https://datatracker.ietf.org/doc/html/rfc6838#section-4.2] is 
> more restrictive, e.g.,
>      --- (sub)types are required to start with a letter or number
>      --- fewer non-letter/number characters are allowed
> - Nutch passes the Content-Type HTTP header value and the URL as metadata 
> hints to MimeTypes.detect(inputstream, metadata). This helped to improve the 
> detection especially for types which are subclasses of application/zip. At 
> least, in the past, this was necessary to handle various Office document 
> types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to