If you vectorized your training data with seq2sparse, you'll need to use
the df-count and dictionary from the training set. You can then
tokenize a new document with a lucene analyzer and count the term
frequencies for all terms in the dictionary. You can then use the
TFIDF class:
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/TFIDF.java
with the corresponding df-count for each term from the training set for
the TF-IDF transformation.
On 03/17/2015 04:46 AM, mw wrote:
Hello,
i am running lda on a training set to create a topic model.
For calculating p(topic|document) on unseen data i need to import the
inverse document frequency from the training set.
Is there a way to do that in mahout?
Best,
Max