Hi, 2015-01-02 11:27 GMT-05:00 Grant Ingersoll <gsing...@apache.org>: > I'm prototyping a new parser (to be donated) for a file type that already > has a parser. This parser will only be applicable for certain sub types of > that file type. How is this best handled in an auto-detection scenario? > Are there hints we can give the MIME detector?
I see two ways to handle this: 1. The "do the right thing" approach: Tika knows how to handle media type hierarchies and optional type parameters when matching the detected media type to the appropriate parser. So you could either define an extra media type and mark it as a subtype of the more generic type (like application/java-archive is to application/zip) or add extra type parameters to add more detailed type information (like text/plain;charset=utf-8 is to text/plain). You can then define your new parser to only accept files of that specific subtype or parameter. Once type detection can correctly detect such files, your parser will automatically be used to parse them. 2. The "worse is better" option: The above option requires you to defining a new subtype or a parameter and to modify the type detection mechanism to correctly detect such files. To avoid the extra work, you could simply mark your new parser as being able to handle all files of the more generic type, and then in your parser include a fallback option to call the original Tika parser when encountering a file the new parser can't handle. BR, Jukka Zitting