Hi Lucene users, This might seem a little vague to people just using lucene. I am trying to see if I can use lucene for my specific problem
I am trying to build a classification solution, where in I need to index each *structured* document into its category in training phase, and lookup a suitable category for a document on runtime. I have a naive algorithm ready, that generates TFIDF vectors from the document, with custom boost values for each field in the document, and computes cosine similarity on the fly for the document to be classified. My problem: - Do this classification in 5 different languages - The target categories are not large, so I dont necessarily need an inverted index, but it does not gurt Where does Lucene fit in? - Lucene gives me standard interface to process various languages (Tokenizers/Analyzers under org.apache.lucene.analysis) - Lucene gives me persistence of my index over the corpus I want to decide in betwen following two approaches - 1. Use lucene directly, and build my algorithm over it 2. Just use the language specific classes from lucene , and continue to build on my algorithm Am sure many of you might have hit this scenario. What do you guys recommend? Regards, Jeetu ps: I am not on the list, so please cc me on the replies