On Wed, 16 Jan 2013, Cedric Meury wrote:
A) Why does Tika only support 3-gram profiles? In the code, the legacy
format is even referenced in comments (LanguageProfileBuilder):
It looks like wherever the code came from had made that change. Sadly,
there's no issue number with the commit:
r1147277 | oleg | 2011-07-15 19:48:36 +0100 (Fri, 15 Jul 2011) | 1 line
added ngram profiler and its tests, also added an optinton to the
TikaCLI.java for lang.profile creation and its test
So I can't be sure. Maybe someone can remember back to July 2011? Maybe
someone could hunt it out in the list archive? :)
B) I am not a linguistics expert, is there a way to convert the legacy
profiles into 3-gram files expected by Tika 1.2?
I think you can probably turn 4-grams into 3-grams, but you'll need to be
very careful with the statistics. 1-grams and 2-grams can't become 3-grams
as you don't have enough information to work out how to combine them
Nick