Hi,

2015-01-02 11:27 GMT-05:00 Grant Ingersoll <gsing...@apache.org>:
> I'm prototyping a new parser (to be donated) for a file type that already
> has a parser.  This parser will only be applicable for certain sub types of
> that file type.  How is this best handled in an auto-detection scenario?
> Are there hints we can give the MIME detector?

I see two ways to handle this:

1. The "do the right thing" approach: Tika knows how to handle media
type hierarchies and optional type parameters when matching the
detected media type to the appropriate parser. So you could either
define an extra media type and mark it as a subtype of the more
generic type (like application/java-archive is to application/zip) or
add extra type parameters to add more detailed type information (like
text/plain;charset=utf-8 is to text/plain). You can then define your
new parser to only accept files of that specific subtype or parameter.
Once type detection can correctly detect such files, your parser will
automatically be used to parse them.

2. The "worse is better" option: The above option requires you to
defining a new subtype or a parameter and to modify the type detection
mechanism to correctly detect such files. To avoid the extra work, you
could simply mark your new parser as being able to handle all files of
the more generic type, and then in your parser include a fallback
option to call the original Tika parser when encountering a file the
new parser can't handle.

BR,

Jukka Zitting

Reply via email to