Clustering is an intensive task. Carrot2 is an excellent framework that
clusters documents and even labels them, and it takes a few seconds (up to
two seconds) to cluster 100 search results snippets.
If you are going to cluster entire documents you'll have to wait longer.
Lorenzo
On 11/23/05, Supr
It depends on the kind of implementation you are thinking of.
You can use Lucene to create the inputs to the LSI, and then use them in
your own system. I've written that code and it works, for searches and
clustering.
But if you are figuring out an LSI enhanced Lucene search system (based on a
spec
Some months ago I created an index from the reuters collection. I converted
the SGML files to XML using a tool that I've found somewhere on the net
(just google for it), then I parsed the files to create the index, using a
standard DOM parser. If you have problems parsing the SGML files I think you
itional commands, e-mail: [EMAIL PROTECTED]
>
>
--
Lorenzo Viscanti
Hi, I'm trying to modify the StandardTokenizer, in order to get to get a
good tokenization for my needs.
Basically I would like to separate two tokens when I find an apostrophe. I
think I should modify the StandardTokenizer.jj file to do that, but I'm in
trouble while changing the grammar. Can some
I use my own LSI implementation based on Lucene for text clustering.
I've done some tests, but I do believe that integrating LSI onto the lucene
search subsystem (i.e. creating something like LSISimilarity) is not an easy
task
I start analyzing the documents using Lucene, and then extract tfidf va