[ https://issues.apache.org/jira/browse/TIKA-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898794#comment-17898794 ]
Sebastian Nagel commented on TIKA-4351: --------------------------------------- No, not working on anything. But happy to help writing regular expressions or testing any PR or draft on a long list of Content-Type headers. I'm also not confident to which extend a stricter validation is possible. I mean, all MIME types in IANA registry follow a clear pattern. Then there are `x-` types, but most of even these seem to a great extend follow the pattern (except for the x- prefix) used in the registry. However, it's a broad field and there are many non-standardized file formats. > More restrictive MIME type validation > ------------------------------------- > > Key: TIKA-4351 > URL: https://issues.apache.org/jira/browse/TIKA-4351 > Project: Tika > Issue Type: Improvement > Components: core, mime > Affects Versions: 3.0.0 > Reporter: Sebastian Nagel > Priority: Major > > Background: > - [~tallison] started a [discussion on the Common Crawl user > group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ] > about strange and obviously erroneous "identified" MIME types in Common Crawl > data which were identified in Nutch using Tika's magic detector. See > [o.a.nutch.util.MimeUtil#autoResolveContentType|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L153] > for the source code. > - the issue is tracked on Nutch's site in NUTCH-3089 > - however, implementing a complex MIME type validation seems out of Nutch's > scope and is eventually better done and maintained by Tika > While looking at more examples, digging deeper and trying to improve the > detection code in Nutch, I came up with the following points regarding the > validation of the MIME type in > [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)]. > The method is used both from Nutch and Tika (in > [MimeTypes#detect(...)|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#detect(java.io.InputStream,org.apache.tika.metadata.Metadata)]): > - "forName" accepts non-ASCII Unicode characters as part of the MIME type > ({{foo/bär}}) - not covered by [RFC > 2045|https://datatracker.ietf.org/doc/html/rfc2045#section-5.1] which allows > only US_ASCII characters. Of course, one might argue, that already the HTTP > header parser should filter such headers away, but ... > - the grammar in RFC 2045 is lazy interpreted, that is a type or subtype may > include the allowed characters in any order > -- (sub)types not registered at IANA are accepted even if they do not start > with "x-" / "X-" / "x." > -- [RFC 6838|https://datatracker.ietf.org/doc/html/rfc6838#section-4.2] is > more restrictive, e.g., > --- (sub)types are required to start with a letter or number > --- fewer non-letter/number characters are allowed > - Nutch passes the Content-Type HTTP header value and the URL as metadata > hints to MimeTypes.detect(inputstream, metadata). This helped to improve the > detection especially for types which are subclasses of application/zip. At > least, in the past, this was necessary to handle various Office document > types. -- This message was sent by Atlassian Jira (v8.20.10#820010)