Hello everyone, I'm working on a project, where I'm trying to extract topics from news articles. I have around 500,000 articles as a dataset. Here are the steps that I'm following:
1. First of all I'm doing some sort of preprocessing. For this I'm using Behemoth to annotate the document and get rid of non-English documents, 2. Then I'm running Mahout's sparse vector command to generate TF-IDF vectors. The problem with TF-IDF vector is that the number of words for a document is far more than the number of words in TF vectors. Moreover there are some words/terms in TF-IDF vector that didn't appear in that specific document anyway. Is this a correct behaviour or there is something wrong with my approach? Thanks in advance! Ani
