Hi,

2015-01-02 16:37 GMT-05:00 Grant Ingersoll <gsing...@apache.org>:
> I think the problem is that the file types in question are not discernible
> by anything other than the actual content, with the big problem being this
> is an expensive operation.

Right, then approach 2 might work better, or Tyler's suggestion to
just modify the existing parser.

> I'll poke around here a bit and see if anything stands out.

A related point is the way the POI container detector uses the
TikaInputStream.get/setOpenContainer() mechanism [1] to pass the
results of any early heavy lifting from type detection to the parsing
phase [2].

[1] 
https://tika.apache.org/1.6/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer()
[2] 
https://github.com/apache/tika/blob/1.6/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java#L385

BR,

Jukka Zitting

Reply via email to