Hi, 2015-01-02 16:37 GMT-05:00 Grant Ingersoll <gsing...@apache.org>: > I think the problem is that the file types in question are not discernible > by anything other than the actual content, with the big problem being this > is an expensive operation.
Right, then approach 2 might work better, or Tyler's suggestion to just modify the existing parser. > I'll poke around here a bit and see if anything stands out. A related point is the way the POI container detector uses the TikaInputStream.get/setOpenContainer() mechanism [1] to pass the results of any early heavy lifting from type detection to the parsing phase [2]. [1] https://tika.apache.org/1.6/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer() [2] https://github.com/apache/tika/blob/1.6/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java#L385 BR, Jukka Zitting