[ 
https://issues.apache.org/jira/browse/TIKA-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated TIKA-4351:
----------------------------------
    Description: 
Background:
- [~tallison] started a [discussion on the Common Crawl user 
group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ] 
about strange and obviously erroneous "identified" MIME types in Common Crawl 
data which were identified in Nutch using Tika's magic detector. See 
[o.a.nutch.util.MimeUtil#autoResolveContentType|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L153]
 for the source code.
- the issue is tracked on Nutch's site in NUTCH-3089
- however, implementing a complex MIME type validation seems out of Nutch's 
scope and is eventually better done and maintained by Tika

While looking at more examples, digging deeper and trying to improve the 
detection code in Nutch, I came up with the following points regarding the 
validation of the MIME type in 
[MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)].
 The method is used both from Nutch and Tika (in 
[MimeTypes#detect(...)|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#detect(java.io.InputStream,org.apache.tika.metadata.Metadata)]):

- "forName" accepts non-ASCII Unicode characters as part of the MIME type 
({{foo/bär}}) - not covered by [RFC 
2045|https://datatracker.ietf.org/doc/html/rfc2045#section-5.1] which allows 
only US_ASCII characters. Of course, one might argue, that already the HTTP 
header parser should filter such headers away, but ...

- the grammar in RFC 2045 is lazy interpreted, that is a type or subtype may 
include the allowed characters in any order
  -- (sub)types not registered at IANA are accepted even if they do not start 
with "x-" / "X-" / "x."
  -- [RFC 6838|https://datatracker.ietf.org/doc/html/rfc6838#section-4.2] is 
more restrictive, e.g.,
     --- (sub)types are required to start with a letter or number
     --- fewer non-letter/number characters are allowed

- Nutch passes the Content-Type HTTP header value and the URL as metadata hints 
to MimeTypes.detect(inputstream, metadata). This helped to improve the 
detection especially for types which are subclasses of application/zip. At 
least, in the past, this was necessary to handle various Office document types.

  was:
Background:
- [~tallison] started a [discussion on the Common Crawl user 
group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ] 
about strange and obviously erroneous "identified" MIME types in Common Crawl 
data which were identified in Nutch using Tika's magic detector. See 
[o.a.nutch.util.MimeUtil#autoResolveContentType|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L153]
 for the source code.
- the issue is tracked on Nutch's site in NUTCH-3089
- however, implementing a complex MIME type validation seems out of Nutch's 
scope and is eventually better done and maintained by Tika

While looking at more examples, digging deeper and trying to improve the 
detection code in Nutch, I came up with the following points regarding the 
validation of the MIME type in 
[MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)].
 The method is used both from Nutch and Tika (in 
[MimeTypes#detect(...)|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#detect(java.io.InputStream,org.apache.tika.metadata.Metadata)]):

- "forName" accepts non-ASCII Unicode characters as part of the MIME type 
({{foo/bär}}) - not covered by [RFC 
2045|https://datatracker.ietf.org/doc/html/rfc2045#section-5.1] which allows 
only US_ASCII characters. Of course, one might argue, that already the HTTP 
header parser should filter such headers away, but ...

- the grammar in RFC 2045 is lazy interpreted, that is a type or subtype may 
include the allowed characters in any order
  - (sub)types not registered at IANA are accepted even if they do not start 
with "x-" / "X-" / "x."
  - [RFC 6838|https://datatracker.ietf.org/doc/html/rfc6838#section-4.2] is 
more restrictive, e.g., it requires that (sub)types start with a letter or 
number

- Nutch passes the Content-Type HTTP header value and the URL as metadata hints 
to MimeTypes.detect(inputstream, metadata). This helped to improve the 
detection especially for types which are subclasses of application/zip. At 
least, in the past, this was necessary to handle various Office document types.


> More restrictive MIME type validation
> -------------------------------------
>
>                 Key: TIKA-4351
>                 URL: https://issues.apache.org/jira/browse/TIKA-4351
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, mime
>    Affects Versions: 3.0.0
>            Reporter: Sebastian Nagel
>            Priority: Major
>
> Background:
> - [~tallison] started a [discussion on the Common Crawl user 
> group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ] 
> about strange and obviously erroneous "identified" MIME types in Common Crawl 
> data which were identified in Nutch using Tika's magic detector. See 
> [o.a.nutch.util.MimeUtil#autoResolveContentType|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L153]
>  for the source code.
> - the issue is tracked on Nutch's site in NUTCH-3089
> - however, implementing a complex MIME type validation seems out of Nutch's 
> scope and is eventually better done and maintained by Tika
> While looking at more examples, digging deeper and trying to improve the 
> detection code in Nutch, I came up with the following points regarding the 
> validation of the MIME type in 
> [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)].
>  The method is used both from Nutch and Tika (in 
> [MimeTypes#detect(...)|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#detect(java.io.InputStream,org.apache.tika.metadata.Metadata)]):
> - "forName" accepts non-ASCII Unicode characters as part of the MIME type 
> ({{foo/bär}}) - not covered by [RFC 
> 2045|https://datatracker.ietf.org/doc/html/rfc2045#section-5.1] which allows 
> only US_ASCII characters. Of course, one might argue, that already the HTTP 
> header parser should filter such headers away, but ...
> - the grammar in RFC 2045 is lazy interpreted, that is a type or subtype may 
> include the allowed characters in any order
>   -- (sub)types not registered at IANA are accepted even if they do not start 
> with "x-" / "X-" / "x."
>   -- [RFC 6838|https://datatracker.ietf.org/doc/html/rfc6838#section-4.2] is 
> more restrictive, e.g.,
>      --- (sub)types are required to start with a letter or number
>      --- fewer non-letter/number characters are allowed
> - Nutch passes the Content-Type HTTP header value and the URL as metadata 
> hints to MimeTypes.detect(inputstream, metadata). This helped to improve the 
> detection especially for types which are subclasses of application/zip. At 
> least, in the past, this was necessary to handle various Office document 
> types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to