Hi Jeetu,

wether or not it makes sense to use Lucene as your data matrix depends a bit on your requirements. There is a Bayesian classifier available in the issue tracker <http://issues.apache.org/jira/browse/ LUCENE-1039> that might be helpful, although it does need a little bit of refactoring in order to handle more than one field as the class value.

The biggest problem with naive classifiers (according to me) is the speed on a large data set. If this is a problem for you and your data set is not way to large then InstantiatedIndex might be a good fit. And if that is not enough I would take a look at libSVM. You could also take a look at Weka that contains quite a few compilable classifiers available. The problem with Weka is that your data set is rather limited to amount of RAM in your computer, while using a naive classifier on top of a Lucene index allows for very large data set. You could of course also use Weka in order to do some feature selection and then only use the output when using your naive classifier that access Lucene. It would speed things up and you can recalculate the feature selection at any time if your data set changes.

You should also check out Apache Mahout, <http://lucene.apache.org/mahout >.

I hope this helps.


      karl

19 maj 2009 kl. 02.55 skrev Jeetendra Mirchandani:

Hi Lucene users,

This might seem a little vague to people just using lucene. I am trying to
see if I can use lucene for my specific problem

I am trying to build a classification solution, where in I need to index each *structured* document into its category in training phase, and lookup a
suitable category for a document on runtime.

I have a naive algorithm ready, that generates TFIDF vectors from the
document, with custom boost values for each field in the document, and
computes cosine similarity on the fly for the document to be classified.

My problem:
- Do this classification in 5 different languages
- The target categories are not large, so I dont necessarily need an
inverted index, but it does not gurt

Where does Lucene fit in?

- Lucene gives me standard interface to process various languages
(Tokenizers/Analyzers under org.apache.lucene.analysis)
- Lucene gives me persistence of my index over the corpus

I want to decide in betwen following two approaches -
1. Use lucene directly, and build my algorithm over it
2. Just use the language specific classes from lucene , and continue to
build on my algorithm

Am sure many of you might have hit this scenario. What do you guys
recommend?

Regards,
Jeetu

ps: I am not on the list, so please cc me on the replies


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to