On Wed, 16 Jan 2013, Cedric Meury wrote:
A) Why does Tika only support 3-gram profiles? In the code, the legacy
format is even referenced in comments (LanguageProfileBuilder):

It looks like wherever the code came from had made that change. Sadly, there's no issue number with the commit:

r1147277 | oleg | 2011-07-15 19:48:36 +0100 (Fri, 15 Jul 2011) | 1 line
  added ngram profiler and its tests, also added an optinton to the
  TikaCLI.java for lang.profile creation and its test

So I can't be sure. Maybe someone can remember back to July 2011? Maybe someone could hunt it out in the list archive? :)

B) I am not a linguistics expert, is there a way to convert the legacy profiles into 3-gram files expected by Tika 1.2?

I think you can probably turn 4-grams into 3-grams, but you'll need to be very careful with the statistics. 1-grams and 2-grams can't become 3-grams as you don't have enough information to work out how to combine them

Nick

Reply via email to