Dear all

I am trying to upgrade from Nutch 0.9 to Tika 1.2. My current n-gram
profiles feature 1- to 4-grams and thus cannot be read by Tika, as it only
supports 3-gram profile files. I have two questions:

A) Why does Tika only support 3-gram profiles? In the code, the legacy
format is even referenced in comments (LanguageProfileBuilder):
    /** The minimum length allowed for a ngram. */
    final static int ABSOLUTE_MIN_NGRAM_LENGTH = 3; /* was 1 */
    /** The maximum length allowed for a ngram. */
    final static int ABSOLUTE_MAX_NGRAM_LENGTH = 3; /* was 4 */

B) I am not a linguistics expert, is there a way to convert the legacy
profiles into 3-gram files expected by Tika 1.2?

Best
Cedric

Reply via email to