Re: Search clustering question

2005-11-24 Thread Lorenzo Viscanti
Clustering is an intensive task. Carrot2 is an excellent framework that clusters documents and even labels them, and it takes a few seconds (up to two seconds) to cluster 100 search results snippets. If you are going to cluster entire documents you'll have to wait longer. Lorenzo On 11/23/05, Supr

Re: Lucene + LSI

2005-11-30 Thread Lorenzo Viscanti
It depends on the kind of implementation you are thinking of. You can use Lucene to create the inputs to the LSI, and then use them in your own system. I've written that code and it works, for searches and clustering. But if you are figuring out an LSI enhanced Lucene search system (based on a spec

Re: Reuters

2006-04-21 Thread Lorenzo Viscanti
Some months ago I created an index from the reuters collection. I converted the SGML files to XML using a tool that I've found somewhere on the net (just google for it), then I parsed the files to create the index, using a standard DOM parser. If you have problems parsing the SGML files I think you

Re: Difference between minMergeDocs and mergeFactor

2005-05-09 Thread Lorenzo Viscanti
itional commands, e-mail: [EMAIL PROTECTED] > > -- Lorenzo Viscanti

StandardTokenizer

2005-09-27 Thread Lorenzo Viscanti
Hi, I'm trying to modify the StandardTokenizer, in order to get to get a good tokenization for my needs. Basically I would like to separate two tokens when I find an apostrophe. I think I should modify the StandardTokenizer.jj file to do that, but I'm in trouble while changing the grammar. Can some

Re: Regarding Lucene and LSI

2005-10-07 Thread Lorenzo Viscanti
I use my own LSI implementation based on Lucene for text clustering. I've done some tests, but I do believe that integrating LSI onto the lucene search subsystem (i.e. creating something like LSISimilarity) is not an easy task I start analyzing the documents using Lucene, and then extract tfidf va