Using Lucene for a classification problem

Jeetendra Mirchandani Mon, 18 May 2009 18:44:31 -0700

Hi Lucene users,

This might seem a little vague to people just using lucene. I am trying to
see if I can use lucene for my specific problem


I am trying to build a classification solution, where in I need to index
each *structured* document into its category in training phase, and lookup a
suitable category for a document on runtime.

I have a naive algorithm ready, that generates TFIDF vectors from the
document, with custom boost values for each field in the document, and
computes cosine similarity on the fly for the document to be classified.

My problem:
- Do this classification in 5 different languages
- The target categories are not large, so I dont necessarily need an
inverted index, but it does not gurt

Where does Lucene fit in?

- Lucene gives me standard interface to process various languages
(Tokenizers/Analyzers under org.apache.lucene.analysis)
- Lucene gives me persistence of my index over the corpus

I want to decide in betwen following two approaches -
1. Use lucene directly, and build my algorithm over it
2. Just use the language specific classes from lucene , and continue to
build on my algorithm

Am sure many of you might have hit this scenario. What do you guys
recommend?

Regards,
Jeetu

ps: I am not on the list, so please cc me on the replies

Using Lucene for a classification problem

Reply via email to