Lucene Tokenizer + Merge terms

2009-08-17 Thread joe_coder
I am using a custom analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { StandardTokenizer tokenStream = new StandardTokenizer(reader); tokenStream.setMaxTokenLength(maxTokenLength); TokenStream result = new ASCIIFoldingFilter(tokenStream);

Personalized Search

2009-08-14 Thread joe_coder
Lets say we have 4 users : U1, U2, U3 and U4. Each user has a title and set of documents created by him/her. Using this info, we can come up with a term vector ( interest vector ) which would contain a set of top terms ( that appeared in his/her docs ) along with frequency. So conceptually, we get

Re: Term Extraction

2009-08-13 Thread joe_coder
t;token: "+t)); t = ts.next(); } But I would need to enhance it to include - Split on hyphen,semicolon etc - stemming ( porter ) - synonyms Thanks joe_coder wrote: > > Grant, thanks for responding. > > My issue is that I am not planning to use lucene ( as I d

Re: Term Extraction

2009-08-13 Thread joe_coder
alize that I would need to do some preprocessing to remove stopwords, stem words and also check for synonyms. So wondering if there is already such code present in lucene ( or any other project ) that I can use directly. Thanks! Grant Ingersoll-6 wrote: > > > On Aug 13, 2009, at 7:40

Term Extraction

2009-08-13 Thread joe_coder
I was wondering if there is any way to directly use Lucene API to extract terms from a given string. My requirement is that I have a text document for which I need a term frequency vector ( after stemming, removing stopwords and synonyms checks ). The result needs to be the terms and frequency. I