On Sat, 14 Dec 2013, Albretch Mueller wrote:
On the sections 7.2 (pg. 115) ... of "tika in action", they talk in very general terms about that theme and mentioned that tika currently uses n-grams but may change the underlying algorithm in the future
I think it's based on tri-grams, with some code originally from Nutch, but I'm not certain. There has certainly been talk of using some more recent code, quite possibly with a wider range of gram sizes (is that the right term?), but it's not an area of the codebase I'm all that strong on
Nick
