Hello everyone,

I'm working on a project, where I'm trying to extract topics from news 
articles. I have around 500,000 articles as a dataset. Here are the steps that 
I'm following:

1. First of all I'm doing some sort of preprocessing. For this I'm using 
Behemoth to annotate the document and get rid of non-English documents,
2. Then I'm running Mahout's sparse vector command to generate TF-IDF vectors. 
The problem with TF-IDF vector is that the number of words for a document is 
far more than the number of words in TF vectors. Moreover there are some 
words/terms in TF-IDF vector that didn't appear in that specific document 
anyway. Is this a correct behaviour or there is something wrong with my 
approach?

Thanks in advance!

Ani

Reply via email to