If you vectorized your training data with seq2sparse, you'll need to use the df-count and dictionary from the training set. You can then tokenize a new document with a lucene analyzer and count the term frequencies for all terms in the dictionary. You can then use the TFIDF class:

https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/TFIDF.java

with the corresponding df-count for each term from the training set for the TF-IDF transformation.



On 03/17/2015 04:46 AM, mw wrote:
Hello,

i am running lda on a training set to create a topic model.
For calculating p(topic|document) on unseen data i need to import the inverse document frequency from the training set.
Is there a way to do that in mahout?

Best,
Max

Reply via email to