Re: Remove/Filter emails from a TokenStream?

2013-06-12 Thread Gucko Gucko
Hello, I figured out how to solve this. I just added stopTypes.add(""); On Wed, Jun 12, 2013 at 8:39 PM, Gucko Gucko wrote: > Hello all, > > is there a filter I can use to remove emails from a TokenStream? > > so far I'm using this to remove numbers, URls, and I

Remove/Filter emails from a TokenStream?

2013-06-12 Thread Gucko Gucko
Hello all, is there a filter I can use to remove emails from a TokenStream? so far I'm using this to remove numbers, URls, and I would like to remove emails too: Tokenizer tokenizer = new UAX29URLEmailTokenizer(Version.LUCENE_43, new StringReader(text)); Set stopTypes = new HashSet(); st

Re: Exception while creating a Tokenizer

2013-06-12 Thread Gucko Gucko
ath carefully and make sure > all JAR files of Lucene have the same version and no duplicate JARs with > different versions are in it! > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >

Exception while creating a Tokenizer

2013-06-12 Thread Gucko Gucko
Hello all, I'm trying the following code (trying to play with Tokenizers in order to create my own Analyzer) but I'm getting an exception: public class TokenizerTest { public static void main(String[] args) throws IOException { String text = "A #revolution http://hi.com in t...@test.com softwa

How to get the most frequent words for a set of documents in Lucene?

2013-06-09 Thread Gucko Gucko
Hello all, I'm trying to cluster documents that were indexed using Lucene 4.3. The results of the clustering algorithm is a set of clusters where each cluster contains the most similar documents (I only store their docIDs in each cluster). What I want is to get the most frequent words for each clu