Re: Using Lucene for a classification problem

Karl Wettin Tue, 19 May 2009 05:39:31 -0700

Hi Jeetu,

wether or not it makes sense to use Lucene as your data matrix dependsa bit on your requirements. There is a Bayesian classifier availablein the issue tracker <http://issues.apache.org/jira/browse/LUCENE-1039> that might be helpful, although it does need a little bitof refactoring in order to handle more than one field as the classvalue.

The biggest problem with naive classifiers (according to me) is thespeed on a large data set. If this is a problem for you and your dataset is not way to large then InstantiatedIndex might be a good fit.And if that is not enough I would take a look at libSVM. You couldalso take a look at Weka that contains quite a few compilableclassifiers available. The problem with Weka is that your data set israther limited to amount of RAM in your computer, while using a naiveclassifier on top of a Lucene index allows for very large data set.You could of course also use Weka in order to do some featureselection and then only use the output when using your naiveclassifier that access Lucene. It would speed things up and you canrecalculate the feature selection at any time if your data set changes.

You should also check out Apache Mahout, <http://lucene.apache.org/mahout>.


I hope this helps.


      karl

19 maj 2009 kl. 02.55 skrev Jeetendra Mirchandani:

Hi Lucene users,

This might seem a little vague to people just using lucene. I amtrying to

see if I can use lucene for my specific problem

I am trying to build a classification solution, where in I need toindexeach *structured* document into its category in training phase, andlookup a

suitable category for a document on runtime.

I have a naive algorithm ready, that generates TFIDF vectors from the
document, with custom boost values for each field in the document, and

computes cosine similarity on the fly for the document to beclassified.


My problem:
- Do this classification in 5 different languages
- The target categories are not large, so I dont necessarily need an
inverted index, but it does not gurt

Where does Lucene fit in?

- Lucene gives me standard interface to process various languages
(Tokenizers/Analyzers under org.apache.lucene.analysis)
- Lucene gives me persistence of my index over the corpus

I want to decide in betwen following two approaches -
1. Use lucene directly, and build my algorithm over it

2. Just use the language specific classes from lucene , and continueto

build on my algorithm

Am sure many of you might have hit this scenario. What do you guys
recommend?

Regards,
Jeetu

ps: I am not on the list, so please cc me on the replies



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using Lucene for a classification problem

Reply via email to